External Storage - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

External Storage

Description:

External Storage Chapter 10 Plus Considerable Instructor Notes from Other Sources. * * * Index Sequential more =- VSAM!! So, an indexed sequential is ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 38
Provided by: unfEdubr
Learn more at: https://www.unf.edu
Category:
Tags: external | storage

less

Transcript and Presenter's Notes

Title: External Storage


1
External Storage
  • Chapter 10
  • Plus Considerable Instructor Notes from Other
    Sources.

2
Introduction External Storage
  • One of the most important topics we will discuss.
  • We know about multi-way trees, such as 2-3-4
    Trees, and these concepts generalize.
  • But when it comes to external storage, there are
    additional major considerations we must consider.

3
Introduction External Storage
  • External Storage organization and access (not the
    same!!!)
  • External Storage organization and retrieval
    are much different than those techniques applied
    to primary memory data structures and
    techniques!!!

4
Accessing External DataA Very Different
Ballgame!
  • The data structures so far entirely in RAM (or
    primary memory)
  • RAM random access memory is many orders of
    magnitude quicker than random access disk
    access. MANY ORDERS OF MAGNITUDE!! .
  • RAM is very limiting indeed! Data is NOT
    persistent!
  • Amount of data limited by size of primary memory
    typically shared with many other concurrent
    users
  • Data, of course, is volatile.
  • Advantage of Data Structures in RAM
  • For data structures, the Speed of RAM is very
    enticing. But for real IS / IT applications,
    many applications require external storage in the
    form of a database system or a file system.
  • Advantages of RAM for in-memory data structures
    are gone!
  • Disadvantage
  • Persistence, number of data items (records) etc.
  • Amount of memory available to your process
    (program)
  • Significant IS applications in real world very
    often require lots of data to process.
  • External disk files - only practical way.
  • ? Disk processing is really the mainstay of
    information processing whether the files are
    legacy type, database files, etc

5
Accessing External Data Very Slow Access
  • RAM Access any location (byte addressable) in
    memory accessed equally fast hence RAM.
  • Disks
  • Data is organized along circular (concentric)
    circles
  • Tracks 000 to track 2999 for example (outer to
    inner)
  • Disks come in various architectures
  • Some have different storage capacities in
    different tracks.
  • Called constant angular velocity and constant
    linear velocity disks! LOOK UP!
  • Traditional disks (CAV) have same number of
    bits/track different densities! (Velocity of
    outer tracks is much greater than inner tracks)
  • Disk organization
  • Concentric platters, access arm(s), rotational
    speeds, seek time, data transfer times, etc.
  • Disk Controllers small computers executing disk
    commands shift registers, etc.

6
Disk Access Four Key Components
  • 1. Seek time
  • Movement of access arm to correct cylinder
  • The most significant factor in access
  • May take 20 msec
  • 2. Head Select electronic speeds negligible
    fastest component
  • 3. Rotational Delay half speed of rotation
    maybe 5-10 msec
  • 4. Data Transfer pretty quick too
  • Tracks divided into fixed-sized sectors. Sector
    size predetermined at factory. (Some are marked
    as defective.)
  • Disk access times of 10 or less msec are common.
  • Important to note disk access times are being
    reduced all the time, but so are main memory
    access time and at a faster rate.
  • ? Book points out that the disparity continues
    to grow!

7
Sequential File Organization and Access
  • Looking for a specific record in a sequential
    file
  • While one can search for a specific record in a
    sequential file, for a large file, this is very
    painful.
  • This is called sequential access.
  • Individual record searches are NOT practical, but
    can be done.
  • Entire file might need to be searched!
  • File is organized sequentially.

8
More on Sequential Files
  • Sequential files are great for reporting,
    printing / displaying information, etc., but not
    for on-line retrieval of records!!
  • Insertions? Book does a poor job here.
  • While it is true that conceptually inserting a
    record into an existing file would require all
    records to be moved forward (assuming the
    blocks are full), in practice this is often NOT
    done because the file would not be available for
    online access during this physical update!
  • Sequential files are NOT USED for online
    processing.
  • Transactions may be batched up during the day and
    the file is typically updated one time and
    offline
  • (Note no one can access the file during
    updating) prior to generating end-of-day reports,
    exception reporting, summary reports, inventory
    reports, sales reports, statistical reports,
    listings, etc. if required...

9
Still More on Sequential Files
  • Need quick retrieval? Sequential Files wont cut
    it.
  • If we have a need for near immediate response
    time
  • we need a different kind of file organization
    one that provides for fast access too. (These go
    together)
  • ? Sequential Files are super for the right
    purposes.
  • Payroll, budgets, listings, rosters, accounts
    receivable, payables, inventories, hosts and
    hosts of items
  • but NOT any kind of on-line processing
    (retrieval, updates)
  • Can hold tremendous amounts of data.
  • Often used for backup as well as master files.

10
B-Trees and Indexing
  • Overall objective of B-Trees and similar indexing
    schemes is to support
  • Fast search
  • Fast insertion ? External Files
  • Fast deletion
  • Fast access

11
B-Trees
  • Need a different kind of tree than 2-3-4 trees.
  • 2-3-4 trees
  • Fine for in-memory operations, but volume and
    persistency needs limit the applicability to
    large files.
  • For large files we need more data items (records)
    per node so when we retrieve them from disk, we
    retrieve into RAM (and store to disk) more
    records / block.
  • Idea is to have few disk retrievals (much time)
    and then do sequential searches of retrieved
    blocks (nodes) in RAM for the desired record
    very quick.
  • (Recall 2-3-4 trees We retrieved a node and
    then searched it sequentially to determine / find
    the item we desired. But we were limited by size
    and availability)

12
B-Trees One Block per Node
  • B-Trees are designed to have many children /node.
  • Remember 2-3-4 trees are B-trees of order 4.
  • This structure will not support corporate file
    needs in the industrial sector at all!
  • B-Tree - much like the 2-3-4 tree but has many
    more nodes and many more links to children,
  • Keeps the tree height small ? few disk I/Os
  • (Nomenclature If a node has (book example) 17
    child pointers, the tree is said to be an order
    17 B-tree.)

13
B-Trees Searching
  • The number of levels in a B-Tree is relatively
    small and the number of records in a node is
    relatively large
  • Implies downward searching is fast (few levels)
    and
  • Searching node sequentially is quick once node
    is retrieved.
  • ? Typically a block (node) is randomly accessed
    based on some kind of key or index, read into
    memory, and searched sequentially
  • If record found, we are done.
  • If records exist on all levels, (recall 2-3-4
    trees), if the search for the desired record gets
    a high result, then the record is not in this
    node, and the access method retrieves the next
    block / node using another child pointer in the
    block.
  • Record found or leaf reached w/o finding record.
    Search top down.

14
B-Trees Insertions - 1
  • Want nodes reasonably full
  • Facilitates likelihood of finding correct record
    increased once we retrieve the node (block)
  • To Insert
  • A node split divides the data items equally
  • half go to a newly created node and half remain
    in old node.
  • Node splits performed bottom up, unlike 2-3-4
    trees where they were split going top down.

15
B-Trees Insertions 2
  • Book goes into a good example to show how a new
    record is inserted. Do go through this. Will
    see.
  • Approach is pretty simple, though.
  • Starting with a leaf node, we will fill it up
  • If we now add a 70 to this order-five B-Tree
  • This causes a node split, and the record numbers
    (only keys are shown) indicates that 60 is the
    middle number and is thus promoted upward, as we
    would expect.

20 40 60 80
16
B-Tree Insertion - 3 Example (Book)
So, we simply have the resultant
B-Tree. Pretty straightforward. Adding a
10 and 30 causes no problems They are added to
the lower left leaf node. Note records are
maintained in an ascending (ordered) order.
60
70 80
20 40
10 20 30 40
17
B-Tree Insertion -4 Example (Book)
  • But when we want to now add a 15 to this leaf
    (remember, we insert from the bottom up), we will
    have a node split again.
  • The record keys are 10, 15, 20, 30, and 40 such
    that 20 is the middle and is promoted up to its
    parent, and the arrangement now looks like

18
B-Trees Insertion -5 Algorithm continued
The 20 is the middle record the node is split
evenly with the middle record promoted up.
Records in the node (block) are kept sequential.
20 60
10 15
30 40
70 80
The process continues. Ultimately, the root
node above will also become full, and, using the
exact same algorithm, the tree grows by one level
upwards!
19
B-Trees Insertion -6 Algorithm
  • Clearly, larger nodes means fewer levels needed
    implies more records within a node,
  • This is both good and bad.
  • Larger blocks are retrieved from disk but more
    RAM (buffer space) is needed to hold the larger
    node in your process area.
  • But can sequentially search more items per node
  • ? Note too no node except the root will ever
    be less than half full.

20
B-Tree Efficiency
  • Very fast organization and access for retrievals
    / updates
  • Because there are so many records / node, the
    number of levels is relatively small.
  • If we have a B-Tree of order 9, the height of
    tree is somewhat less than log9n (log base 9 of
    n) that is 9 raised to what power equals n.
  • Number is going to be one more than the max
    height of the tree.
  • Example Let n 500,000 recs. 96 531,441.
    Tree six levels.
  • This means that as a max, only six accesses are
    necessary to find the correct node in the file of
    500,000 records.
  • Book at 10 msec /access, ? 60 msec to get the
    right block (node) (internal comparisons
    and searches are negligible re to disk accesses).
  • Cannot even compare this to those figures of
    retrieval of a record in a sequential file.

21
B-Tree Efficiency 2
  • Remember no free lunch.
  • Book hedges on size of the node and it is NOT
    free!
  • Biggest advantage of B-Trees is for adding and
    deleting records. This is very significant!
  • Of course we need to search to find the correct
    node into which we insert or from which we
    delete.
  • Remember too, the vast majority of the time, in a
    B-Tree a node will not have to be split!
  • Splitting the node requires timeand disk
    accesses.
  • Records can be accessed both randomly and
    sequentially.

22
Variations of B-Trees Implementation
  • Important to note that in some implementations of
    B-Trees, only the leaf nodes contain records
  • This is a B Tree. Extremely important.
  • In this scenario, non-leaf nodes only contain
    keys and block numbers (indices, if you will, to
    nodes)
  • Higher level nodes only contain keys / block
    numbers and thus contain many more keys /
    block!
  • This is what is done in VSAM! (not mentioned in
    your book).
  • VSAM is old but still prevails!!
  • (IBM Virtual Storage Access Method)

23
Supplementary Information
A B-tree is a tree data structure that keeps data
sorted and allows searches, sequential access,
insertions, and deletions in logarithmic
amortized time. The B-tree is a generalization
of a binary search tree in that a node can have
more than two children. Unlike self-balancing
binary search trees, the B-tree is optimized for
systems that read and write large blocks of data.
It is commonly used in databases and
filesystems. In B-trees, internal
(non-leaf) nodes can have a variable number of
child nodes within some pre-defined range. When
data is inserted or removed from a node, number
of child nodes changes. In order to maintain
the pre-defined range, internal nodes may be
joined/split. Because a range of child nodes is
permitted, B-trees do not need re-balancing as
frequently as other self-balancing search trees,
such as Red Black trees, but may waste some
space, since nodes are not entirely full
24
Supplementary Information
A B-tree is kept balanced by requiring that all
leaf nodes are at the same depth. This depth
will increase slowly as elements are added to the
tree, but an increase in the overall depth is
infrequent, and results in all leaf nodes being
one more node further away from the
root. B-trees have substantial advantages over
alternative implementations when node access
times far exceed access times within nodes. This
usually occurs when nodes are in secondary
storage like disk drives. By maximizing the
number of child nodes within each internal node,
the height of the tree decreases and the number
of expensive node accesses is reduced. -----------
----------------------- In addition, rebalancing
the tree occurs less often. The maximum number
of child nodes depends on the information that
must be stored for each child node and the size
of a full disk block or an analogous size in
secondary storage.
25
Supplementary Information B and B Trees
  • The term B-tree may refer to a specific design or
    it may refer to a general class of designs.
  • In the narrow sense, a B-tree stores keys in its
    internal nodes but need not store those keys
    again in the records at the leaves.
  • The general class includes variations such as the
    B-tree and the B-tree.
  • In the B-tree, copies of the keys are stored in
    the internal nodes the keys and records are
    stored in leaves in addition, This is IBMs VSAM
    KSDS. (Indexed Sequential File Organization)
  • Leaf node may include pointer to next leaf node
    to speed sequential access
  • -------------------------------
  • The B-tree balances more neighboring internal
    nodes to keep the internal nodes more densely
    packed.
  • For example, a non-root node of a B-tree must be
    only half full, but a non-root node of a B-tree
    must be two-thirds full.

26
Supplementary Information B and B Trees
A B tree or B plus tree is a type of tree which
represents sorted data in a way that allows for
efficient insertion, retrieval and removal of
records, each record of which is identified by a
key. It is a dynamic, multilevel index, with
maximum and minimum bounds on the number of keys
in each index segment (usually called a "block"
or "node"). ? In a B tree, in contrast to a
B-tree, all records are stored at the leaf level
of the tree only keys are stored in interior
nodes. The primary value of a B tree is in
storing data for efficient retrieval in a
block-oriented storage contextin particular,
filesystems. This is primarily because unlike
binary search trees, B trees have very high
fanout (typically on the order of 100 or more),
which reduces the number of I/O operations to the
next level down - required to find an element in
the tree.
27
Supplementary Information B and B Trees
The leaves (the bottom-most index blocks) of the
B tree are often linked to one another in a
linked list this makes range queries or an
(ordered) iteration through the blocks simpler
and more efficieny. This does not substantially
increase space consumption or maintenance on the
tree. This illustrates one of the significant
advantages of a B-tree over a B-tree in a
B-tree, since not all keys are present in the
leaves, such an ordered linked list cannot be
constructed. A B-tree is thus particularly
useful as a database system index, where the data
typically resides on disk, as it allows the
B-tree to actually provide an efficient
structure for housing the data itself. -----------
------- If nodes of the B tree are organized as
arrays of elements, then it may take a
considerable time to insert or delete an element
as half of the array will need to be shifted on
average, since the data is organized
sequentially. To overcome this problem, elements
inside a node can be organized in a binary tree
or a B tree instead of an array.
28
Different Kinds of Trees
  • Binary trees Binary search tree (BST)  Van
    Emde Boas tree  Cartesian tree  Top Tree 
    T-tree
  • Self-balancing binary search trees Red-black
    tree  AVL tree  AA tree  Splay tree 
    Scapegoat tree  Treap
  • B-trees B tree  B-tree  UB-tree  2-3
    tree  2-3-4 tree  (a,b)-tree  Dancing tree 
    Htree  Bx-tree
  • Tries Suffix tree  Radix tree  Ternary
    search tree
  • Binary space partitioning (BSP) trees
    Quadtree  Octree  kd-tree (implicit)  VP-tree
  • Non-binary trees Exponential tree  Fusion
    tree  Interval tree  PQ tree  Range tree 
    SPQR tree
  • Spatial data partitioning trees R-tree  R
    tree  R tree  X-tree  M-tree  Segment tree 
    Fenwick tree  Hilbert R-tree
  • Other trees Heap  Hash tree  Finger tree 
    Metric tree  Cover tree  BK-tree 
    Doubly-chained tree 

29
Databases
  • A Database is a collection of data organized
    in a fashion that facilitates updating,
    retrieving, and managing the data.
  • Can consist of anything, including, but not
    limited to names, addresses, pictures, and
    numbers. Databases are commonplace and are used
    everyday.
  • Very common for a database to contain millions of
    records requiring many gigabytes of storage.
  • Because databases cannot typically be maintained
    entirely in memory, B-trees are often used to
    index data and provide fast access.
  • Searching an unindexed and unsorted database
    containing n key values will have a worst case
    running time of O(n)
  • If same data is indexed with a B-tree, same
    search operation will run in O(log n).

30
Indexed Sequential File Organization
  • This type of file arrangement is called indexed
    sequential because it contains an index file for
    random access (searching, inserting, deleting,
    etc) as well as an ordering of indices with
    pointers to support sequential access, for
    reports and a host of standard business
    operations.
  • In VSAM (Virtual Storage Access Method) there are
    levels of indices, that is indexes for the
    indexes (called index sets / sequence sets)
  • Indexed Sequential Files in VSAM are called Key
    Sequenced Data Sets and the structure of the
    files are B Trees (that is, data records are
    at the leafs indexes/keys are in the upper
    nodes sequence sets and, if needed, index sets)
  • Data records are stored in what IBM calls Control
    Areas and Control Intervals.
  • In IBM parlance, a node split is a control
    interval split. Same idea!
  • Records are inserted bottom up as in a B-Tree.

31
Indexed File
  • The index consists of small records of keys and
    pointers
  • Blocks of Records (control intervals) are indexed
    by the keys
  • An index may point to the highest key within a
    control interval.
  • Pointers point to a control interval (block)
    where the record with a desired value (key) may
    be located.
  • Theoretically, a random access requires two
    accesses at least one for the index and one for
    the node of records!
  • In truth, the indexed file is usually brought
    into memory during these operations and
    maintained in a cache store to facilitate fast
    retrieval.
  • Here is what the VSAM organization looks like for
    Key Sequenced Data Sets (KSDSs)

32
INDEX COMPONENT
Pointers have values and block numbers


INDEX SET
SEQUENCE SET
. . .
DATA COMPONENT
CONTROL INTERVALS
. . .
CONTROL AREA
CONTROL AREA
CONTROL AREA
33
Key values plus pointers to blocks Note the
points to support sequential operations.
KEY VALUES EXTREMELY EXAGGERATED!!
I1
I2
62 S2
INDEX SET
FREE
9/S1
SEQUENCE SETS
S1
S2
S3
3 D1
9 D2
36 D3
62 D4
FREE
FREE
D1
D2
D3
D4
36
FREE
FREE
FREE
FREE
1
3
5
9
35
42
43
62
CONTROL INTERVALS
CONTROL INTERVALS
CONTROL AREA
34
Index Sequential more - VSAM!!
  • ? So, an indexed sequential is organized
    Indexed Sequentially and can be accessed
    both sequentially and randomly.
  • Organized by a record key (unique)
  • Random Access of Records Records can be
    accessed randomly (via the indices (record key))
    and sequentially via the nodes and pointers.
  • Sequential Access Only the leaf nodes contain
    records, but in turn, they point to the next node
    that contains the next group of sequential
    records in order to support sequential
    processing.
  • Random Access does not need those pointers to
    the next node search proceeds as previously
    described using index sets and sequence sets and
    pointers to control intervals.

35
Directories
  • OK. Nice, but wheres the catch.
  • Indexed Sequential File, there is No 1-1
    correspondence between indices and actual
    records. IS a correspondence between indices and
    nodes / control intervals in IBM.
  • Sometimes we need a directory.
  • A one-to-one correspondence between an index
    record and a data record, we call the index file
    a directory.
  • (Think telephone directory)
  • Directories are VERY fast, but incur a lot of
    overhead to maintain this 1-1 correspondence
    between indices and data records...
  • May be perfect for real time applications!!

36
  • In practice (not in book), most index files are
    not organized as
  • Rather, they contain the largest key of a control
    interval (node).
  • Then, in looking for a record, we search lt index
    record which will point us to a node where the
    desired record may/may not be.
  • This reduces maintenance on the index file, which
    would slow down insertions / deletions
    dramatically.

37
Indexed File Alternate Keys
  • Now, in indexed sequential operations, we may
    have indexes of more than one key.
  • Perhaps we wish to have random access based on
    ssan (record key?) as well as, perhaps, a
    secondary key, such as an account number
    (unique), and, perhaps we also need to be able to
    retrieve on last name (non-unique secondary key).
  • We can do it all with additional indices!
  • However, each index must be maintained and
    updated whenever we add / delete records.
  • Thus significant overhead. But if it is needed,
    it is needed.
  • Thus there are decisions practitioners must make
    Yes, we can support a clients being able to
    retrieve on any number of keys, but the price may
    be high, and the complexity may also be high.
Write a Comment
User Comments (0)
About PowerShow.com