Title: Comparison of Communication and IO of the Cray T3E and IBM SP
1Comparison of Communication and I/O of the Cray
T3E and IBM SP
- Jonathan Carter
- NERSC User Services
2Overview
- Node Characteristics
- Interconnect Characteristics
- MPI Performance
- I/O Configuration
- I/O Performance
3T3E Architecture
- Distributed memory, single CPU processing elements
4T3E Communication Network
- Processing Elements (PE) are connected by a 3D
torus.
5T3E Communication Network
- The peak bandwidth of the torus is about 600
Mbyte/sec bidirectional - Sustainable bandwidth is about 480 Mbytes/sec
bidirectional - Latency is ? 1µs
- Shmem API gives latency of 1µs, bandwidth 350
Mbyte/sec bidirectional
6SP Architecture
Interconnect
Memory
CPU
CPU
7SP Communication Network
- Nodes are connected via adapters to the SP
Switch. Switch is composed of boards which link
16 nodes. Boards are linked to form larger
network.
Switch Board
8SP Communication Network
- The peak bandwidth of adapter and switch is 300
Mbyte/sec bidirectional - Latency of the switch is about 2µs
- Sustainable bandwidth is about 185 Mbytes/sec
bidirectional
9MPI Performance
Intra-node is 1 MPI process per node, 2 MPI
processes (typical) will halve bandwidth
10MPI Performance
11MPI Performance
12T3E I/O Configuration
- PEs do not have local disk
- All PEs access all filesystems equivalently
- Path for (optimum) I/O generally looks like
- PE to I/O node via torus
- I/O node to Fibre Channel Node (FCN) via Gigaring
- FCN to Disk Array via Fibre loop
- In some cases data on APP PE must be transferred
to a system buffer on an OS PE then out to an FCN
13T3E I/O Configuration
14SP I/O Configuration
- Nodes have local disk. One SCSI disk for all
local filesystems. Non-optimal. - All nodes access Global Parallel File System
(GPFS) filesystems equivalently - Path for GPFS I/O looks like
- Node to GPFS Node via IP over the switch
- GPFS Node to Disk Array via SSA loop
15SP I/O Configuration
Disk Array
GPFS Nodes
Nodes
Switch
Switch
16T3E Filesystems
- /usr/tmp
- fast
- subject to 14 day purge, not backed up
- check quota with quota -s /usr/tmp (usually 75Gb
and 6000 inodes) - TMPDIR
- fast
- purged at end of job or session
- shares quota with /usr/tmp
- HOME
- slower
- permanent, backed up
- check quota with quota (usually 2Gb and 3500
inodes)
17SP Filesystems
- /scratch and SCRATCH
- global
- fast (GPFS)
- subject to 14 day purge (or at session end for
SCRATCH), not backed up - check quota with myquota (usually 100Gb and 6000
inodes) - TMPDIR
- local (created in /scr) - only 2 Gbyte total
- slower
- purged at end of job or session
- HOME
- global
- slower (GPFS)
- permanent, not backed up yet
- check quota with myquota (usually 4Gb and 5000
inodes)
18Types of I/O
- Bewildering number of choices on both machines
- Standard Language I/O Fortran or C (ANSI or
POSIX) - Vendor extensions to language I/O
- MPI I/O
- Cray FFIO library (can be used from Fortran or C)
- IBM MIO library, requires code changes
19Standard Language I/O
- Fortran direct access is slightly more efficient
then sequential access both on the T3E (see
comments on FFIO later) and the SP. It also
allows file transferability. - C language I/O (fopen, fwrite, etc.) is
inefficient on both machines. - POSIX standard I/O (open, read, etc.) can be
efficient on the T3E, but requires care (see
comments on FFIO later). Works well on the SP.
20Vendor Extensions to Language I/O
- Cray has a number of I/O routines (aqopen, etc.)
which are legacies from the PVP systems.
Non-portable. - IBM has extended Fortran syntax to provide
asynchronous I/O. Non-portable.
21MPI I/O
- Part of MPI-2
- Interface for High Performance Parallel I/O
- data partitioning
- collective I/O
- asynchronous I/O
- portability and interoperability bwteen T3E and
SP - Different subset implemented on T3E and SP
22Summary of access routines for T3E
23Summary of access routines for SP
24Cray FFIO library
- FFIO is a set of I/O layers tuned for different
I/O characteristics - Buffering of data (configurable size)
- Caching of data (configurable size)
- Available to regular Fortran I/O without
reprogramming - Available for C through POSIX-like calls, e.g.
ffopen, ffwrite
25FFIO - The assign command
- controls program behavior at runtime
- the assign command controls
- controls which FFIO layer is active
- striping across multiple partitions
- lots more
- scope of assign
- File name
- Fortran unit number
- File type (e.g. all sequential unformatted files)
26IBM MIO library
- User interface based on POSIX I/O routines, so
requires program modification - Useful trace module to collect statistics
- Not much experience with using on GPFS filesystem
- Coming soon
27I/O Strategies - Exclusive access files
- Each process reads and writes to a separate file
- Language I/O
- Increase language I/O performance with FFIO
library (for example, sepcify a large buffer with
the bufa layer) on T3E. For Fortran direct access
default buffer is only the maximum of the record
length or 32 Kbytes - read/write large amounts of data per request on
the SP - MPI I/O
- read/write large amounts of data per request
28bufa FFIO layer Overview
- bufa is an asynchronous buffering layer
- performs read-ahead, write-behind
- specify buffer size with -F bufabsnbufs where
bs is the buffer size in units of 4Kbyte blocks,
and nbufs is the number of buffers - buffer space increases your applications memory
requirements
29I/O Strategies - Shared files
- All PEs read and write the same file
simultaneously - Language I/O (requires FFIO library global layer
for T3E) - MPI I/O
- On T3E, language I/O with FFIO library global
layer and Cray extensions for additional
flexibility
30Positioning with a shared file
- Positioning of a read or write is your
responsibility - File pointers are private
- Fortran
- Use a direct access file, and read/write(recnum)
- Use Cray T3E extensions setpos and getpos to
position file pointer (not portable) - C
- Use ffseek
- MPI I/O
- MPI I/O fileview generally takes care of this.
Positioning routines also available.
31global FFIO layer Overview
- global is a caching and buffering layer which
enables multiple PEs to read and write to the
same file - if one PE has already read the data, an
additional read request from another PE will
result in a remote memory copy - file open is a synchronizing event
- By default, all PEs must open a global file, this
can be changed by calling GLIO_GROUP_MPI(comm) - specify buffer size with -F globalbsnbufs where
bs is the buffer size in units of 4Kbyte blocks,
and nbufs is the number of buffers per PE
32GPFS and shared files
- On the T3E the global FFIO layer takes care of
updates to a file from multiple PEs by tracking
the state of the file across all PEs. - On the SP, GPFS implements a safe update scheme
via tokens and a token manager. - If two processes access the same block of a GPFS
file (256 Kbytes), a negotiation is conducted
between the nodes and the token manager to
determine the order of updates. This can slow
down I/O considerably. - MPI I/O merges requests from different processes
to alleviate this problem
33I/O Performance Comparison
- Each process writes a 200 Mbyte file. 2 processes
per node on SP.
34Further Information
- I/O on the T3E Tutorial by Richard Gerber at
http//home.nersc.gov/training/tutorials - Cray Publication - Application Programmers I/O
Guide - Cray Publication - Cray T3E Fortran Optimization
Guide - man assign
- XL Fortran Users Guide