Architectural and Design - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Architectural and Design

Description:

Shark video server. Video streaming from single RS/6000 ... Tiger Shark multimedia file system. Multimedia file system for RS/6000 SP ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 38
Provided by: csHai
Category:

less

Transcript and Presenter's Notes

Title: Architectural and Design


1
Architectural and Design Issues in the General
Parallel File System
May, 2005
Benny Mandler - mandler_at_il.ibm.com
2
Agenda
  • What is GPFS?
  • A file system for high performance computing
  • General architecture
  • How does GPFS meet its challenges - architectural
    issues
  • Performance
  • Scalability
  • High availability
  • Concurrency control

3
What is Parallel I/O?
  • Multiple processes (possibly on multiple nodes)
    participate in the I/O
  • Application level parallelism
  • File is stored on multiple disks on a parallel
    file system
  • Additional Interfaces for I/O (can impact
    portability)

4
What is Parallel I/O? (Cont.)
  • Parallel I/O should safely support
  • Application level I/O parallelism across multiple
    computational nodes
  • Physical parallelism over multiple disks and
    servers
  • Parallelism in file system overhead operations
  • A parallel file system must support
  • Parallel I/O
  • Consistent global name space across all nodes of
    the cluster
  • Including maintaining a consistent view across
    all nodes for the same file
  • Programming model allowing programs to access
    file data
  • Distributed over multiple nodes
  • From multiple tasks running on multiple nodes
  • Physical distribution of data across disks and
    network entities
  • Eliminates bottlenecks both at the disk interface
    and the network, providing more effective
    bandwidth to the I/O resources

5
GPFS vs. local and distributed file systems on
the SP
  • Native AIX File System (JFS)
  • No file sharing - application can only access
    files on its own node
  • Applications must do their own data partitioning
  • DCE Distributed File System
  • Application nodes (DCE clients) share files on
    server node
  • Switch is used as a fast LAN
  • Coarse-grained (file or segment level)
    parallelism
  • Server node is performance and capacity bottleneck
  • GPFS Parallel File System
  • GPFS file systems are striped across multiple
    disks on multiple storage nodes
  • Independent GPFS instances run on each
    application node
  • GPFS instances use storage nodes as "block
    servers" - all instances can access all disks

6
Scalable Parallel Computing
  • RS/6000 SP Scalable Parallel Computer
  • Hundreds of nodes connected by high-speed switch
  • N-Way SMP
  • gt1 TB disk per node
  • Hundreds of MB/s full duplex per switch port
  • Scalable parallel computing enables I/O-intensive
    applications
  • Deep computing - simulation, seismic analysis,
    data mining
  • Server consolidation - aggregating file, web
    servers on centrally-managed machine
  • Streaming video and audio for multimedia
    presentation
  • Scalable object store for large digital
    libraries, web servers, databases, ...

7
GPFS History
  • Shark video server
  • Video streaming from single RS/6000
  • Complete system, included file system, network
    driver, control server
  • Large data blocks, admission control, deadline
    scheduling
  • Bell Atlantic video-on-demand trial (1993-94)
  • Tiger Shark multimedia file system
  • Multimedia file system for RS/6000 SP
  • Data striped across multiple disks, accessible
    from all nodes
  • Hong Kong and Tokyo video trials, Austin video
    server products
  • GPFS parallel file system
  • General purpose file system for commercial and
    technical computing on RS/6000 SP, AIX and Linux
    clusters.
  • Recovery, online system management, byte-range
    locking, fast prefetch, parallel allocation,
    scalable directory, small-block random access,
    ...
  • Released as a product 1.1 - 05/98, 1.2 - 12/98,
    1.3 - 04/00,

8
What is GPFS? IBMs shared disk, parallel file
system for AIX, Linux clusters
  • Cluster 512 nodes today, fast reliable
    communication, common admin domain
  • Shared disk all data and metadata on disk
    accessible from any node through disk I/O
    interface (i.e., "any to any" connectivity)
  • Parallel data and metadata flows from all of the
    nodes to all of the disks in parallel
  • RAS reliability, accessibility, serviceability

9
GPFS addresses SP I/O requirementsHigh
Performance - multiple GB/s to/from a single file
  • Concurrent reads and writes, parallel data access
    - within a file and across files
  • Byte-range locking
  • Support fully parallel access both to file data
    and metadata
  • Client caching enabled by distributed locking
  • Wide striping
  • Large data blocks
  • Prefetch, write-behind
  • Access pattern optimizations
  • Distributed management functions
  • Multi-pathing

10
GPFS addresses SP I/O requirements (Cont.)
  • Scalability in many respects
  • Scales up to 512 nodes (N-Way SMP)
  • Storage nodes
  • File system nodes
  • Disks 100s of TB
  • Adapters
  • High Availability
  • Fault-tolerance via logging, replication, RAID
    support
  • Survives node and disk failures
  • Uniform access via shared disks - Single image
    file system
  • High capacity multiple TB per file system, 100s
    of GB per file
  • Standards compliant (X/Open 4.0 "POSIX") with
    minor exceptions and extensions

11
GPFS comes in different flavors
Storage Area Network
  • Advantages
  • Separate storage I/O service and application jobs
  • Well suited to synchronous applications
  • Can utilize extra switch bandwidth
  • Disadvantages
  • Performance gated by adapters in the servers
  • Advantages
  • Performance scales with the number of servers
  • Uses un-used compute cycles if available
  • Can utilize extra switch bandwidth
  • Disadvantages
  • Cycle stealing from the compute nodes
  • Advantages
  • Simpler I/O model a storage I/O operation does
    not require an associated network I/O operation
  • Can be used instead of a switch when a switch is
    not otherwise needed
  • Disadvantages
  • Cost/complexity when building large SANs

12
Agenda
  • What is GPFS?
  • A file system for high performance computing
  • General architecture
  • How does GPFS meet its challenges - architectural
    issues
  • Performance
  • Scalability
  • High availability
  • Concurrency control

13
Shared Disks - Virtual Shared Disk architecture
  • File systems consist of one or more shared disks
  • Individual disk can contain data, metadata, or
    both
  • Disks are designated to failure group
  • Data and metadata are striped to balance load and
    maximize parallelism
  • Recoverable Virtual Shared Disk for accessing
    disk storage
  • Disks are physically attached to SP nodes
  • VSD allows access to disks over the SP switch
  • VSD client looks like disk device driver on
    client node
  • VSD server executes I/O requests on storage
    node.
  • VSD supports JBOD or RAID volumes, fencing,
    multi-pathing (where physical hardware permits)
  • GPFS only assumes a conventional block I/O
    interface

14
GPFS Architecture Overview
  • Implications of Shared Disk Model
  • All data and metadata on globally accessible
    disks (VSD)
  • All access to permanent data through disk I/O
    interface
  • Distributed protocols, e.g., distributed locking,
    coordinate disk access from multiple nodes
  • Fine-grained locking allows parallel access by
    multiple clients
  • Logging and Shadowing restore consistency after
    node failures
  • Implications of Large Scale
  • Support up to 4096 disks of up to 1 TB each (4
    Petabytes)
  • The largest system in production is 75 TB
  • Failure detection and recovery protocols to
    handle node failures
  • Replication and/or RAID protect against disk /
    storage node failure
  • On-line dynamic reconfiguration (add, delete,
    replace disks and nodes rebalance file system)

15
GPFS Architecture - Special Node Roles
  • Three types of nodes file system, storage, and
    manager
  • File system nodes
  • Run user programs, read/write data to/from
    storage nodes
  • Implement virtual file system interface
  • Cooperate with manager nodes to perform metadata
    operations
  • Manager nodes
  • Global lock manager
  • File system configuration recovery, adding
    disks,
  • Disk space allocation manager
  • Quota manager
  • File metadata manager - maintains file metadata
    integrity
  • Storage nodes
  • Implement block I/O interface
  • Shared access to file system and manager nodes
  • Interact with manager nodes for recovery (e.g.
    fencing)
  • Data and metadata striped across multiple disks -
    multiple storage nodes

16
GPFS Software Structure
17
Disk Data Structures Files
  • Large block size allows efficient use of disk
    bandwidth
  • Fragments reduce space overhead for small files
  • No designated "mirror", no fixed placement
    function
  • Flexible replication (e.g., replicate only
    metadata, or only important files)
  • Dynamic reconfiguration data can migrate
    block-by-block
  • Multi level indirect blocks
  • Each disk address
  • List of pointers to replicas
  • Each pointer
  • Disk id sector no.

18
Agenda
  • What is GPFS?
  • A file system for High Performance Computing
  • General architecture
  • How does GPFS meet its challenges - architectural
    issues
  • Performance
  • Scalability
  • High availability
  • Concurrency control

19
Large File Block Size
  • Conventional file systems store data in small
    blocks to pack data more densely
  • GPFS uses large blocks (256KB default) to
    optimize disk transfer speed

20
Parallelism and consistency
  • Distributed locking - acquire appropriate lock
    for every operation - for updates to user data
  • Centralized management - conflicting operations
    forwarded to a designated node - for file
    metadata
  • Distributed locking centralized hints - for
    space allocation
  • Central coordinator - used for configuration
    changes

I/O slowdown effects Additional I/O activity
rather than token server overload
21
Parallel File Access From Multiple Nodes
  • GPFS allows parallel applications on multiple
    nodes to access non-overlapping ranges of a
    single file with no conflict
  • Global locking serializes access to overlapping
    ranges of a file
  • Global locking based on "tokens" which convey
    access rights to an object (e.g. a file) or
    subset of an object (e.g. a byte range)
  • Tokens can be held across file system operations,
    enabling coherent data caching in clients
  • Cached data discarded or written to disk when
    token is revoked
  • Performance optimizations required/desired
    ranges, metanode, data shipping, special token
    modes for file size operations

22
I/O throughput scaling - nodes and disks
  • 32 nodes SP, 480 disks, 2 I/O servers
  • Single file - n large contiguous sections
  • Writes - update in place

23
Deep Prefetch for High Throughput
  • GPFS stripes successive blocks across successive
    disks
  • Disk I/O for sequential reads and writes is done
    in parallel
  • GPFS measures application "think time" ,disk
    throughput, and cache state to automatically
    determine optimal parallelism
  • Prefetch algorithms now recognize strided
  • and reverse sequential access.
  • Accepts hints
  • Write-behind policy

24
GPFS Throughput Scaling for Non-cached Files
  • Hardware Power2 wide nodes, SSA disks
  • Experiment sequential read/write from large
    number of GPFS nodes to varying number of storage
    nodes
  • Result throughput increases nearly linearly with
    number of storage nodes
  • Bottlenecks
  • Microchannel limits node throughput to 50MB/s
  • System throughput limited by available storage
    nodes

25
Disk Data Structures Allocation map
  • Each segment contains bits representing blocks on
    all disks
  • Each segment a separately lockable unit
  • Minimizes contention for allocation map when
    writing files on multiple nodes
  • Allocation manager service provides hints which
    segments to try
  • Inode Allocation map looks similar

26
Allocation Manager
Server
Segment
free
6 Update Free
5 Update Free
7 Update Free
Client
Client
Deleted files blocks are function-shipped to
current segment owners
27
Allocation manager and metanode evaluation
  • Write-in-place vs. new files creation
  • Create throughput scales nearly linearly with
    number of nodes
  • Creating a single file from multiple nodes as
    fast as each node creating a different file

28
HSM - Space Management
ADSM
Migrate
Recall
ADSM SERVER
DB
Migrates inactive data
Transparent recall
Cost/Disk Full Reduction
Policy managed
Integrated with backup
29
Data Management API for GPFS
  • Fully XDSM standard compliant
  • Innovative enhancements entailed by multi-node
    model
  • Intended for HSM applications such as HPSS,
    ADSM, etc.
  • Principles of operation
  • Backend GPFS file operations generate events
    that are monitored by a data management
    application
  • Front-end data management application initiates
    invisible migration of file data between GPFS and
    HSM
  • High throughput using multiple sessions and
    parallel movers
  • Resilient to failures, and provides transparent
    recovery

30
High Availability - Logging and Recovery
  • Problem detect/fix file system inconsistencies
    after a failure of one or more nodes
  • All updates that may leave inconsistencies if
    uncompleted are logged
  • Write-ahead logging policy log record is forced
    to disk before dirty metadata is written
  • Redo log replaying all log records at recovery
    time restores file system consistency
  • Logged updates
  • I/O to replicated data
  • Directory operations (create, delete, move, ...)
  • Allocation map changes
  • Other techniques
  • Ordered writes
  • Shadowing

31
Node Failure Recovery
  • Application node failure
  • Force-on-steal policy ensures that all changes
    visible to other nodes have been written to disk
    and will not be lost
  • All potential inconsistencies are protected by a
    token and are logged
  • File system manager runs log recovery on behalf
    of the failed node
  • After log recovery tokens held by the failed node
    are released
  • Actions taken restore metadata being updated by
    the failed node to a consistent state, release
    resources held by the failed node
  • File system manager failure
  • New node is appointed to take over
  • New file system manager restores volatile state
    by querying other nodes
  • New file system manager may have to undo or
    finish a partially completed configuration change
    (e.g., add/delete disk)
  • Storage node failure
  • Dual-attached disk use alternate path (VSD)
  • Single attached disk treat as disk failure

32
Handling Disk Failures
  • When a disk failure is detected
  • The node that detects the failure informs the
    file system manager
  • File system manager updates the configuration
    data to mark the failed disk as "down" (quorum
    algorithm)
  • While a disk is down
  • Read one / write all available copies
  • "Missing update" bit set in the inode of modified
    files
  • When/if disk recovers
  • File system manager searches inode file for
    missing update bits
  • All data metadata of files with missing updates
    are copied back to the recovering disk (one file
    at a time, normal locking protocol)
  • Until missing update recovery is complete, data
    on the recovering disk is treated as write-only
  • Unrecoverable disk failure
  • Failed disk is deleted from configuration or
    replaced by a new one
  • New replicas are created on the replacement or on
    other disks

33
Concurrency Control High-level Metadata
  • Managed by central coordinators
  • Configuration manager
  • Elected through Group Services
  • Longest surviving node
  • Appoints a manager for each GPFS file system as
    it is mounted
  • File system manager
  • Handles all changes to file system configuration,
    e.g.,
  • Adding/deleting disks (including alloc map
    initialization)
  • The only node that reads writes configuration
    data (superblock)
  • Initiates and coordinates data migration
    (rebalance)
  • Creates and assigns log files
  • Token manager, etc.
  • Appointed by the configuration manager
  • Token manager coordinates distributed locking
    (next slide)
  • Other quota manager, allocation manager, ACL,
    extended attributes

34
Concurrency Control Fine-grain (Meta)data
  • Token based distributed lock manager
  • First lock request for an object requires a
    message to the token manager to obtain a token
  • Subsequent lock requests can be granted locally
  • Data can be cached as long as a token is held
  • When a conflicting lock request is issued from
    another node the token is revoked ("token steal")
  • Force on steal policy modified data are written
    to disk when the token is revoked
  • Whole file locking for less frequent operations
    (e.g., create, trunc, ...). Finer grain locking
    for read/write

35
Parallel System Administration
  • Data redistribution
  • Disk addition/deletion/replacement
  • Replication/striping due to disk failures

36
Cache Management
  • Balance dynamically according to usage patterns
  • Avoid fragmentation - internal and external
  • Unified steal
  • Periodical re-balancing

37
Epilogue
  • Used on six of the ten most powerful
    supercomputers in the world, including the
    largest (ASCI white)
  • Installed at several hundred customer sites, on
    clusters ranging from a few nodes with less than
    a TB of disk, up to 512 nodes with 140 TB of disk
    in 2 file systems
  • IP rich - 20 filed patents
  • State of the art
  • TeraSort
  • World record of 17 minutes
  • Using 488 node SP. 432 file system and 56 storage
    nodes (604e 332 MHz)
  • Total 6 TB disk space
  • References
  • GPFS home page http//www.haifa.il.ibm.com/projec
    ts/storage/gpfs.html
  • FAST 2002 http//www.usenix.org/publications/libr
    ary/proceedings/fast02/schmuck.html
  • TeraSort - http//www.almaden.ibm.com/cs/gpfs-spso
    rt.html
  • Tiger Shark http//www.research.ibm.com/journal/r
    d/422/haskin.html
Write a Comment
User Comments (0)
About PowerShow.com