Chapter 5, 6: Storage and File Organizations, Indexes, RAID PowerPoint PPT Presentation

presentation player overlay
1 / 41
About This Presentation
Transcript and Presenter's Notes

Title: Chapter 5, 6: Storage and File Organizations, Indexes, RAID


1
Chapter 5, 6 Storage and File Organizations,
Indexes, RAID
  • ICS 434 Reading Material Summary
  • October 1, 2003

2
Physical Database Design .. 1.. 1
  • Physical database design is the process of
    choosing specific storage structures and access
    paths for the database files to achieve good
    performance for the various database
    applications.
  • Each DBMS offers a variety of options for file
    organization and access paths
  • Once a specific DBMS is chosen, the physical
    database design process is restricted to choosing
    the most appropriate file organizations and
    access paths for the database files from among
    the options available in the DBMS.

3
Physical Database Design .. 2
  • We know that data in a database systems are
    stored on a secondary storage using media like
    magnetic disks, optical disks, or the like.
    Magnetic tapes are normally used for keeping the
    backup of the databases.
  • The data in the databases are stored using the
    storage structure normally known as file
    organization
  • Some very common file organizations are
    sequential (unordered or unsorted), sequential
    (Ordered or sorted), indexed and hashed

4
Physical Database Design .. 3
  • Each file organization has its own very specific
    characteristics or properties
  • For example, if data is stored in a sequential
    (unordered) file, we cannot perform binary search
    on the file. The data access will be sequential
    and the average retrieval time of a data item in
    the file will directly depend on the size (i.e.,
    number of records or tuples) of file
  • If many records are added to the file and few are
    deleted from the file, the retrieval time of a
    data item from such a file will go on increasing
    and ultimately the response time of the query
    using the data item will become worse and worse.
    If we dont take care of the problem properly,
    the database performance may become unacceptable.

5
Processing a Database Command
  • Important steps in writing data to a database or
    reading data from a database are
  • 1. User submits a data request command to the
    DBMS
  • 2. DBMS analyses the command
  • 3. DBMS generates a database access command that
    should be executed to fulfill the data request of
    the user
  • 4. As a part of the database access, the relevant
    files of the database are opened and blocks of
    data are transferred from the secondary storage
    to a main memory storage area called buffers
  • 5. The data from the buffers is transferred to
    the program work area and manipulated in the work
    area
  • 6. The data from the program work area is
    transferred to the user if the processing is
    complete for retrieval or it is transferred to
    the buffers to continue the processing
  • 7. In case of writing the data to the database,
    the buffers are written back to the storage

6
Selecting File Organization and Access Paths
  • Criteria for choosing appropriate file
    organization and access paths are
  • Response time --- is the time from the submission
    of a database query for execution to receiving a
    response. The response time of a query mainly
    depends on the access time for data items
    referenced in the query
  • Space utilization --- is the amount of storage
    space used by the database files and their access
    path structures like indexes on disk
  • Transaction throughput --- is the average number
    of transactions that can be process per unit time

7
Issues in Physical Database Design
  • In a relational database, each base relation in
    the database schema is stored as a physical
    database file.
  • During the physical database design many design
    decisions need to be made, such as
  • What file organization should be used to store
    the relation?
  • Whether the file should be indexed or not?
  • If indexed then which attribute or attributes
    should be used to create index(s)?
  • Should the index be clustered or not?
  • Should hashing or tree index be used?
  • If it is hashing then should it be static or
    dynamic hashing?

8
File Organizations and Indexes
  • File organizations
  • Heap (Unordered sequential)
  • Ordered sequential
  • Hashed Files
  • Index Structures for Files
  • What is an index?
  • Single-level VS multiple-level indexes
  • Types of single-level indexes
  • Primary index
  • Clustering index
  • Secondary indexes
  • Dynamic multi-level indexes using B, B trees
  •  

9
File Organizations - Heap files
  • Files of unordered records -- records are placed
    in the file in the order in which they are
    inserted.
  • Access
  • Only linear search is possible
  • Insertion
  • A new record is inserted at the end of the file
  • Deletion --- two methods to delete a record from
    a heap file are immediate physical deletion and
    delayed physical deletion
  • In immediate physical deletion method, record to
    be deleted is searched, the disk block containing
    the record is transferred to the buffer, the
    record is deleted, the records following the
    deleted record are adjusted to use the freed
    space, and the block is written back to the disk.
  • In delayed physical deletion method (also known
    as Deletion marker method), the record to be
    deleted is searched if found then it is marked
    as deleted. After some time when the file is
    reorganized then during the reorganization, the
    records marked as deleted are physically
    deleted from the file and freed space is used by
    the other records.

10
File Organizations - Sequential Ordered Files
  • The records of a file on disk are physically
    ordered based on the values of one field called
    the ordering field.
  • Access
  • Various faster search algorithms like binary
    search can be performed.
  • Insertion
  • A new record is inserted at its proper position
    according to the value of the ordering field and
    the other records are adjusted accordingly.
  • Deletion
  • Whenever a record is deleted, the remaining
    records must maintain their order.

11
File Organizations Hash Files
  • A hash file is a file in which a hash function
    uses one key field (also called the hash field)
    value to generate the address of the record in
    the storage, i.e., h ( hash field value)
    addresswhere h is the hash function
  • Examplesuppose we have 1000 records of data
    about employees and each employee is assigned a
    unique 5-digit employee id number. We can use
    mod(employee-id, 1000) as a hash function. The
    function will give us values between 0 and 999
    which can be used as an address of the record in
    the file.

12
File Organizations - Hash Files
  • Insertion, deletion, and search operations on a
    hash file
  • In order to insert a record into the file, the
    hash function is applied onto the key value of
    the record. The record is stored at the address
    generated by the hash function.
  • In order to delete a record from the file, the
    hash function is applied on the key value and the
    record from the address generated by the hash
    function is deleted from the file.
  • In order to search a record from the file, the
    hash function is applied on the key value and the
    record from the address generated by the hash
    function is retrieved from the file.
  • The hash file organization works very fine if the
    hash function can generate a unique address for
    every possible key value in the file. If this is
    not the case, then hash function may generate the
    same address for two or more key values.

13
File Organizations Hash Files
  • A collision occurs when the hash field value of a
    record that is being inserted hashes to an
    address that already contains a different record.
    The process of finding another position to store
    the record is called Collision Resolution
  • There are various methods for collision
    resolution open addressing, chaining, and
    multiple hashing are just three to mention
  • Hashing when applied to store and retrieve file
    from the disks is called external hashing.

14
Index Structures for Files
  • An index is an access structure created to make
    the access to the data in the data file faster.
  • The field of the records in the file on which an
    index is created is called the indexing field.
  • The index structure provide secondary access path
  • Various types of indexes are primary index,
    secondary index, dense index, sparse index, and
    clustering index.
  • Single-level VS multi-level indexes If the size
    of the data file is such that an index structure
    of reasonable size can be created and is
    sufficient to provide an efficient access to the
    data in the file then it is a single-level index.
    But if the size of the index gets very large then
    another index can be created on to the first
    index. Such a scheme is called the multi-level
    index structure.

15
Index Structures for Files
  • Primary index
  • A primary index is an ordered file whose records
    are of fixed length with two fields. The first
    field is of the same type as the ordering key
    field, called the primary key of the data file,
    and the second field is a pointer to a disk block
    .
  • Secondary indexes
  • In addition to the primary index, which can be
    only one for the file, one or more indexes can be
    created on the fields other than the primary key.
    Such indexes are called the secondary indexes.
  • Clustering index
  • If a secondary index is created on a field that
    is not a key then in this index each entry may
    point to many records in the file. Such an index
    is called the clustering index.

16
Index Structures for Files
  • Dense index
  • A dense index is an index in which there is one
    entry for every search key value in the data
    file.
  • Sparse index
  • A sparse index is an index which has entries for
    only some of the search values. It is also called
    non-dense index.

17
Search Trees
  • We can use a search tree as a mechanism to search
    for records stored in a disk file. The values in
    the tree can be the values of one of the fields
    of the file, called the search field
  • Each key value in the tree is associated with a
    pointer to the record in the data file having
    that value. Alternatively, the pointer could be
    to the disk block containing that record.
  • The search tree itself can be stored on disk by
    assigning each tree node to a disk block.
  • When a new record is inserted, the search tree is
    updated by inserting an entry in the tree
    containing the search field value of the new
    record and a pointer to the new record.

18
Search Trees
  • Algorithms are necessary for inserting and
    deleting search values into and from the search
    tree while maintaining the preceding two
    constraints. In general, these algorithms do not
    guarantee that a search tree is balanced, meaning
    that all of its leaf nodes are at the same level.
  • Keeping a search tree balanced is important
    because it guarantees that no nodes will be at
    very high levels and hence require many block
    accesses during a tree search.
  • Another problem with search trees is that record
    deletion may leave some nodes in the tree nearly
    empty, thus wasting storage space and increasing
    the number of levels.

19
Search Tree Example
B-Tree Example
20
B-Trees
  • The B-tree addresses problems of search trees by
    specifying additional constraints on the search
    tree.
  • The B-tree has additional constraints that ensure
    that the tree is always balanced and that the
    space wasted by deletion, if any, never becomes
    excessive.
  • The algorithms for insertion and deletion,
    though, become more complex in order to maintain
    these constraints. Most insertions and deletions
    are simple processes they become complicated
    only under special circumstancesnamely, whenever
    we attempt an insertion into a node that is
    already full or a deletion from a node that makes
    it less than half full.

21
B-Tree Example
22
B-Trees
  • In a B-tree, every value of the search field
    appears once at some level in the tree, along
    with a data pointer. In a B-tree, data pointers
    are stored only at the leaf nodes of the tree
    hence, the structure of leaf nodes differs from
    the structure of internal nodes.
  • The leaf nodes have an entry for every value of
    the search field, along with a data pointer to
    the record (or to the block that contains this
    record) if the search field is a key field.
  • For a nonkey search field, the pointer points to
    a block containing pointers to the data file
    records, creating an extra level of indirection
  • The leaf nodes of the B-tree are usually linked
    together to provide ordered access on the search
    field to the records. These leaf nodes are
    similar to the first (base) level of an index.

23
Most implementations of Dynamic Multilevel index
use B-tree
24
B-Tree Example (Figure 6.12/page 181)
25
Physical Database Design Process
  • After EER design, by mapping it to the relational
    data model, we have the logical schema for the
    database.
  • The next phase in the database design process is
    physical database design.
  • Physical database design is the process in which
    the designer should come up with the appropriate
    structures for the data storage on secondary
    storage devices such that it guarantees good
    performance of the database.
  • Most relational systems store each base relation
    as a physical database file.

26
Physical Database Design Process
  • During the physical database design phase we
  • Choose appropriate file structure for each
    relation in the database schema
  • Decide on if some indexes need to be created to
    make the access to the data efficient
  • Ensure that performance goals of the database
    will be met.

27
Inputs for the Physical Database Design
  • The two main inputs for the physical database
    design are
  • The database workload
  • Users performance requirements
  • Database workload includes
  • A list of queries and their frequencies
  • A list of updates and their frequencies
  • Performance goals for each type of query and
    update

28
Inputs for the Physical database Design
  • For each query in the workload, we must identify
  • Which relations are accessed
  • Which attributes are retained (in the SELECT
    clause)
  • Which attributes have selection or join
    conditions expressed on them (in the WHERE
    clause) and how selective these conditions are
    likely to be
  • For each update in the workload, we must identify
  • Which attributes have selection or join
    conditions expressed on them (in the WHERE
    clause) and how selective these conditions are
    likely to be
  • The type of update (INSERT, DELETE, UPDATE)
  • Relations and attributes that are modified by
    each update

29
Outputs from the Physical database Design
  • The physical database design decisions include
  • Specify the type of file for each relation in the
    database schema
  • The attributes on which indexes should be created
  • Physical design decisions for indexing should
    include
  • What attribute or attributes to index on?
  • Whether to set up a clustered index?
  • Whether to use hash index or a tree index?
  • Which relations to index and which field or
    combination of fields to choose as index search
    keys?
  • For each index, should it be clustered or
    non-clustered?

30
Guidelines For Index Selection
  • Guideline 1Dont build an index unless some
    query benefit from it.Whenever possible, choose
    indexes that speed up more than one query
  • Guideline 2 Attributes mentioned in a WHERE
    clause are candidates for indexing.An
    exact-match selection condition suggests that the
    attribute be considered for indexing.A range
    selection condition suggests that B tree index
    be considered for the selected attributes

31
Guidelines For Index Selection
  • Guideline 3
  • Indexes with multiple-attribute search keys
    should be considered in the following two
    situations
  • A WHERE clause includes conditions on more than
    one attribute of a relation
  • The attributes enable index-only evaluation
    strategies for important queries, i.e., only the
    index will be accessed and accessing the relation
    can be avoided
  • Guideline 4
  • At most one index on a given relation can be
    clustered, and clustering affects the performance
    greatly so the choice of clustering index is
    important. If several range queries are posed on
    a relation involving different set of attributes,
    then these attributes are good candidates for
    clustered indexes.

32
Guidelines For Index Selection
  • Guideline 5
  • A B index is preferable because it supports
    range queries as well as equality queries.
  • Guideline 6
  • After preparing the initial list of indexes to
    create, consider the impact of each index on the
    updates in the workload. If maintaining an index
    slows down frequent update operations, consider
    dropping the index.

33
RAID Levels
  • What is RAID?
  • Data Striping
  • Reliability and performance improvements
  • RAID organization and levels
  •  

34
What is RAID?
  • RAID stands for Redundant Array of Inexpensive
    Disks
  • Rather than storing data on a single set of
    disks, the data is stored on one or more set(s)
    of independent disks which behave like a single
    high performance logical disk.
  • RAID technology comes with seven levels, I.e.,
    level 0, level 1, level 2, level 3, level 4,
    level 5, and level 6.
  • The number of redundant disks depends on the RAID
    level used to store the data, for example RAID
    level 0 has no redundant data (therefore no
    additional disks are used. RAID level 0 is same
    as traditional storage of data. RAID level 1 uses
    two sets of disks rather than one. The second set
    of disks is used to mirror the data, and so on.

35
Data Striping
  • What is data striping?
  • Data striping distributes data over multiple
    disks to make them appear as a single large, fast
    disk.The following diagram shows that FILE A is
    striped into four disks.

36
Data Striping
  • Data Striping uses parallelism to improve the
    disk performance. In the diagram on slide 4,
    suppose the FILE A consists of four disk
    blocks. Now if the FILE A is stored on one disk
    and we want to read the whole file, then we need
    four read operations to read the file. But if the
    file is stored on four independent disks, one
    block on each disk, and the four disks behave
    like a single logical disk then all four blocks
    can be read in parallel and it will take time
    equal to one read operation
  • Because data is striped on to multiple disks,
    therefore some kind of parity check can be used
    to improve the reliability of data very similar
    to the parity bit used in the main memory.

37
Reliability and Performance Improvements
  • Improving reliability with RAID
  • One way to increase reliability of a database
    when using redundant array of independent disks
    by introducing redundancy, i.e., store one data
    item I more than one place.
  • One technique for introducing redundancy is
    called mirroring or shadowing, i.e. store data
    redundantly on two identical physical disks that
    are treated as one logical disk.
  • In case of mirrored data, a data item can be read
    from any disk but for writing the data item must
    be written o both.
  • If one disk fails, the other disk is still there
    to continuously provide the data. It improves he
    reliability.

38
Reliability and Performance Improvements
  • Improving performance with RAID
  • Assume data striping at block level is used and
    rather than storing data on one disk, it is
    stored on four disks. Reading one whole file in
    the later case will take one-fourth of the time
    needed to read the whole file in the former case.
  • Granularity ( the size of data used for
    stripping) cab reduced to even improve the
    performance more. For example, by using 8-disks,
    bit-level data stripping can be used. In this
    case, one byte of data will be read from 8-disks
    (one bit from each disk) in parallel.

39
RAID Organization and Levels
  • RAID levels depend on the data redundancy
    introduced and correctness checking technique
    used in the scheme.
  • Level 0 no redundancy and no correctness
    checking
  • Level 1 redundancy through mirroring and no
    correctness checking
  • Level 2 Mirroring or no mirroring combined with
    memory like correctness checking for example
    using parity bit. Various versions of level 2 are
    possible
  • Level 3 like level 2 but uses single disk for
    parity
  • Level 4 block level data striping and parity
    like level 3
  • Level 5 block level data striping but data and
    parity are distributed across all disks
  • Level 6 just two redundant disks with PQ
    redundancy scheme.

40
RAID Organization and Levels
41
RAID Organization and Levels
Write a Comment
User Comments (0)
About PowerShow.com