Architectural and Design - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Architectural and Design

Description:

Shark video server. Video streaming from single RS/6000 ... Tiger Shark multimedia file system. Multimedia file system for RS/6000 SP ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 38

Provided by: csHai

Category:

more less

Transcript and Presenter's Notes

Title: Architectural and Design

1
Architectural and Design Issues in the General
Parallel File System
May, 2005
Benny Mandler - mandler_at_il.ibm.com
2
Agenda

What is GPFS?
A file system for high performance computing
General architecture
How does GPFS meet its challenges - architectural
issues
Performance
Scalability
High availability
Concurrency control

3
What is Parallel I/O?

Multiple processes (possibly on multiple nodes)
participate in the I/O
Application level parallelism
File is stored on multiple disks on a parallel
file system
Additional Interfaces for I/O (can impact
portability)

4
What is Parallel I/O? (Cont.)

Parallel I/O should safely support
Application level I/O parallelism across multiple
computational nodes
Physical parallelism over multiple disks and
servers
Parallelism in file system overhead operations
A parallel file system must support
Parallel I/O
Consistent global name space across all nodes of
the cluster
Including maintaining a consistent view across
all nodes for the same file
Programming model allowing programs to access
file data
Distributed over multiple nodes
From multiple tasks running on multiple nodes
Physical distribution of data across disks and
network entities
Eliminates bottlenecks both at the disk interface
and the network, providing more effective
bandwidth to the I/O resources

5
GPFS vs. local and distributed file systems on
the SP

Native AIX File System (JFS)
No file sharing - application can only access
files on its own node
Applications must do their own data partitioning

DCE Distributed File System
Application nodes (DCE clients) share files on
server node
Switch is used as a fast LAN
Coarse-grained (file or segment level)
parallelism
Server node is performance and capacity bottleneck

GPFS Parallel File System
GPFS file systems are striped across multiple
disks on multiple storage nodes
Independent GPFS instances run on each
application node
GPFS instances use storage nodes as "block
servers" - all instances can access all disks

6
Scalable Parallel Computing

RS/6000 SP Scalable Parallel Computer
Hundreds of nodes connected by high-speed switch
N-Way SMP
gt1 TB disk per node
Hundreds of MB/s full duplex per switch port
Scalable parallel computing enables I/O-intensive
applications
Deep computing - simulation, seismic analysis,
data mining
Server consolidation - aggregating file, web
servers on centrally-managed machine
Streaming video and audio for multimedia
presentation
Scalable object store for large digital
libraries, web servers, databases, ...

7
GPFS History

Shark video server
Video streaming from single RS/6000
Complete system, included file system, network
driver, control server
Large data blocks, admission control, deadline
scheduling
Bell Atlantic video-on-demand trial (1993-94)
Tiger Shark multimedia file system
Multimedia file system for RS/6000 SP
Data striped across multiple disks, accessible
from all nodes
Hong Kong and Tokyo video trials, Austin video
server products
GPFS parallel file system
General purpose file system for commercial and
technical computing on RS/6000 SP, AIX and Linux
clusters.
Recovery, online system management, byte-range
locking, fast prefetch, parallel allocation,
scalable directory, small-block random access,
...
Released as a product 1.1 - 05/98, 1.2 - 12/98,
1.3 - 04/00,

8
What is GPFS? IBMs shared disk, parallel file
system for AIX, Linux clusters

Cluster 512 nodes today, fast reliable
communication, common admin domain
Shared disk all data and metadata on disk
accessible from any node through disk I/O
interface (i.e., "any to any" connectivity)
Parallel data and metadata flows from all of the
nodes to all of the disks in parallel
RAS reliability, accessibility, serviceability

9
GPFS addresses SP I/O requirementsHigh
Performance - multiple GB/s to/from a single file

Concurrent reads and writes, parallel data access
- within a file and across files
Byte-range locking
Support fully parallel access both to file data
and metadata
Client caching enabled by distributed locking
Wide striping
Large data blocks
Prefetch, write-behind
Access pattern optimizations
Distributed management functions
Multi-pathing

10
GPFS addresses SP I/O requirements (Cont.)

Scalability in many respects
Scales up to 512 nodes (N-Way SMP)
Storage nodes
File system nodes
Disks 100s of TB
Adapters
High Availability
Fault-tolerance via logging, replication, RAID
support
Survives node and disk failures
Uniform access via shared disks - Single image
file system
High capacity multiple TB per file system, 100s
of GB per file
Standards compliant (X/Open 4.0 "POSIX") with
minor exceptions and extensions

11
GPFS comes in different flavors
Storage Area Network

Advantages
Separate storage I/O service and application jobs
Well suited to synchronous applications
Can utilize extra switch bandwidth
Disadvantages
Performance gated by adapters in the servers

Advantages
Performance scales with the number of servers
Uses un-used compute cycles if available
Can utilize extra switch bandwidth
Disadvantages
Cycle stealing from the compute nodes

Advantages
Simpler I/O model a storage I/O operation does
not require an associated network I/O operation
Can be used instead of a switch when a switch is
not otherwise needed
Disadvantages
Cost/complexity when building large SANs

12
Agenda

What is GPFS?
A file system for high performance computing
General architecture
How does GPFS meet its challenges - architectural
issues
Performance
Scalability
High availability
Concurrency control

13
Shared Disks - Virtual Shared Disk architecture

File systems consist of one or more shared disks
Individual disk can contain data, metadata, or
both
Disks are designated to failure group
Data and metadata are striped to balance load and
maximize parallelism
Recoverable Virtual Shared Disk for accessing
disk storage
Disks are physically attached to SP nodes
VSD allows access to disks over the SP switch
VSD client looks like disk device driver on
client node
VSD server executes I/O requests on storage
node.
VSD supports JBOD or RAID volumes, fencing,
multi-pathing (where physical hardware permits)
GPFS only assumes a conventional block I/O
interface

14
GPFS Architecture Overview

Implications of Shared Disk Model
All data and metadata on globally accessible
disks (VSD)
All access to permanent data through disk I/O
interface
Distributed protocols, e.g., distributed locking,
coordinate disk access from multiple nodes
Fine-grained locking allows parallel access by
multiple clients
Logging and Shadowing restore consistency after
node failures
Implications of Large Scale
Support up to 4096 disks of up to 1 TB each (4
Petabytes)
The largest system in production is 75 TB
Failure detection and recovery protocols to
handle node failures
Replication and/or RAID protect against disk /
storage node failure
On-line dynamic reconfiguration (add, delete,
replace disks and nodes rebalance file system)

15
GPFS Architecture - Special Node Roles

Three types of nodes file system, storage, and
manager
File system nodes
Run user programs, read/write data to/from
storage nodes
Implement virtual file system interface
Cooperate with manager nodes to perform metadata
operations
Manager nodes
Global lock manager
File system configuration recovery, adding
disks,
Disk space allocation manager
Quota manager
File metadata manager - maintains file metadata
integrity
Storage nodes
Implement block I/O interface
Shared access to file system and manager nodes
Interact with manager nodes for recovery (e.g.
fencing)
Data and metadata striped across multiple disks -
multiple storage nodes

16
GPFS Software Structure
17
Disk Data Structures Files

Large block size allows efficient use of disk
bandwidth
Fragments reduce space overhead for small files
No designated "mirror", no fixed placement
function
Flexible replication (e.g., replicate only
metadata, or only important files)
Dynamic reconfiguration data can migrate
block-by-block
Multi level indirect blocks

Each disk address
List of pointers to replicas
Each pointer
Disk id sector no.

18
Agenda

What is GPFS?
A file system for High Performance Computing
General architecture
How does GPFS meet its challenges - architectural
issues
Performance
Scalability
High availability
Concurrency control

19
Large File Block Size

Conventional file systems store data in small
blocks to pack data more densely
GPFS uses large blocks (256KB default) to
optimize disk transfer speed

20
Parallelism and consistency

Distributed locking - acquire appropriate lock
for every operation - for updates to user data
Centralized management - conflicting operations
forwarded to a designated node - for file
metadata
Distributed locking centralized hints - for
space allocation
Central coordinator - used for configuration
changes

I/O slowdown effects Additional I/O activity
rather than token server overload
21
Parallel File Access From Multiple Nodes

GPFS allows parallel applications on multiple
nodes to access non-overlapping ranges of a
single file with no conflict
Global locking serializes access to overlapping
ranges of a file
Global locking based on "tokens" which convey
access rights to an object (e.g. a file) or
subset of an object (e.g. a byte range)
Tokens can be held across file system operations,
enabling coherent data caching in clients
Cached data discarded or written to disk when
token is revoked
Performance optimizations required/desired
ranges, metanode, data shipping, special token
modes for file size operations

22
I/O throughput scaling - nodes and disks

32 nodes SP, 480 disks, 2 I/O servers
Single file - n large contiguous sections
Writes - update in place

23
Deep Prefetch for High Throughput

GPFS stripes successive blocks across successive
disks
Disk I/O for sequential reads and writes is done
in parallel
GPFS measures application "think time" ,disk
throughput, and cache state to automatically
determine optimal parallelism
Prefetch algorithms now recognize strided
and reverse sequential access.
Accepts hints
Write-behind policy

24
GPFS Throughput Scaling for Non-cached Files

Hardware Power2 wide nodes, SSA disks
Experiment sequential read/write from large
number of GPFS nodes to varying number of storage
nodes
Result throughput increases nearly linearly with
number of storage nodes
Bottlenecks
Microchannel limits node throughput to 50MB/s
System throughput limited by available storage
nodes

25
Disk Data Structures Allocation map

Each segment contains bits representing blocks on
all disks
Each segment a separately lockable unit
Minimizes contention for allocation map when
writing files on multiple nodes
Allocation manager service provides hints which
segments to try
Inode Allocation map looks similar

26
Allocation Manager
Server
Segment
free
6 Update Free
5 Update Free
7 Update Free
Client
Client
Deleted files blocks are function-shipped to
current segment owners
27
Allocation manager and metanode evaluation

Write-in-place vs. new files creation
Create throughput scales nearly linearly with
number of nodes
Creating a single file from multiple nodes as
fast as each node creating a different file

28
HSM - Space Management
ADSM
Migrate
Recall
ADSM SERVER
DB
Migrates inactive data
Transparent recall
Cost/Disk Full Reduction
Policy managed
Integrated with backup
29
Data Management API for GPFS

Fully XDSM standard compliant
Innovative enhancements entailed by multi-node
model
Intended for HSM applications such as HPSS,
ADSM, etc.
Principles of operation
Backend GPFS file operations generate events
that are monitored by a data management
application
Front-end data management application initiates
invisible migration of file data between GPFS and
HSM
High throughput using multiple sessions and
parallel movers
Resilient to failures, and provides transparent
recovery

30
High Availability - Logging and Recovery

Problem detect/fix file system inconsistencies
after a failure of one or more nodes
All updates that may leave inconsistencies if
uncompleted are logged
Write-ahead logging policy log record is forced
to disk before dirty metadata is written
Redo log replaying all log records at recovery
time restores file system consistency
Logged updates
I/O to replicated data
Directory operations (create, delete, move, ...)
Allocation map changes
Other techniques
Ordered writes
Shadowing

31
Node Failure Recovery

Application node failure
Force-on-steal policy ensures that all changes
visible to other nodes have been written to disk
and will not be lost
All potential inconsistencies are protected by a
token and are logged
File system manager runs log recovery on behalf
of the failed node
After log recovery tokens held by the failed node
are released
Actions taken restore metadata being updated by
the failed node to a consistent state, release
resources held by the failed node
File system manager failure
New node is appointed to take over
New file system manager restores volatile state
by querying other nodes
New file system manager may have to undo or
finish a partially completed configuration change
(e.g., add/delete disk)
Storage node failure
Dual-attached disk use alternate path (VSD)
Single attached disk treat as disk failure

32
Handling Disk Failures

When a disk failure is detected
The node that detects the failure informs the
file system manager
File system manager updates the configuration
data to mark the failed disk as "down" (quorum
algorithm)
While a disk is down
Read one / write all available copies
"Missing update" bit set in the inode of modified
files
When/if disk recovers
File system manager searches inode file for
missing update bits
All data metadata of files with missing updates
are copied back to the recovering disk (one file
at a time, normal locking protocol)
Until missing update recovery is complete, data
on the recovering disk is treated as write-only
Unrecoverable disk failure
Failed disk is deleted from configuration or
replaced by a new one
New replicas are created on the replacement or on
other disks

33
Concurrency Control High-level Metadata

Managed by central coordinators
Configuration manager
Elected through Group Services
Longest surviving node
Appoints a manager for each GPFS file system as
it is mounted
File system manager
Handles all changes to file system configuration,
e.g.,
Adding/deleting disks (including alloc map
initialization)
The only node that reads writes configuration
data (superblock)
Initiates and coordinates data migration
(rebalance)
Creates and assigns log files
Token manager, etc.
Appointed by the configuration manager
Token manager coordinates distributed locking
(next slide)
Other quota manager, allocation manager, ACL,
extended attributes

34
Concurrency Control Fine-grain (Meta)data

Token based distributed lock manager
First lock request for an object requires a
message to the token manager to obtain a token
Subsequent lock requests can be granted locally
Data can be cached as long as a token is held
When a conflicting lock request is issued from
another node the token is revoked ("token steal")
Force on steal policy modified data are written
to disk when the token is revoked
Whole file locking for less frequent operations
(e.g., create, trunc, ...). Finer grain locking
for read/write

35
Parallel System Administration

Data redistribution
Disk addition/deletion/replacement
Replication/striping due to disk failures

36
Cache Management

Balance dynamically according to usage patterns
Avoid fragmentation - internal and external
Unified steal
Periodical re-balancing

37
Epilogue

Used on six of the ten most powerful
supercomputers in the world, including the
largest (ASCI white)
Installed at several hundred customer sites, on
clusters ranging from a few nodes with less than
a TB of disk, up to 512 nodes with 140 TB of disk
in 2 file systems
IP rich - 20 filed patents
State of the art
TeraSort
World record of 17 minutes
Using 488 node SP. 432 file system and 56 storage
nodes (604e 332 MHz)
Total 6 TB disk space
References
GPFS home page http//www.haifa.il.ibm.com/projec
ts/storage/gpfs.html
FAST 2002 http//www.usenix.org/publications/libr
ary/proceedings/fast02/schmuck.html
TeraSort - http//www.almaden.ibm.com/cs/gpfs-spso
rt.html
Tiger Shark http//www.research.ibm.com/journal/r
d/422/haskin.html