Comparison of Communication and IO of the Cray T3E and IBM SP presentation

About This Presentation

Transcript and Presenter's Notes

Title: Comparison of Communication and IO of the Cray T3E and IBM SP

1
Comparison of Communication and I/O of the Cray
T3E and IBM SP

Jonathan Carter
NERSC User Services

2
Overview

Node Characteristics
Interconnect Characteristics
MPI Performance
I/O Configuration
I/O Performance

3
T3E Architecture

Distributed memory, single CPU processing elements

4
T3E Communication Network

Processing Elements (PE) are connected by a 3D
torus.

5
T3E Communication Network

The peak bandwidth of the torus is about 600
Mbyte/sec bidirectional
Sustainable bandwidth is about 480 Mbytes/sec
bidirectional
Latency is ? 1µs
Shmem API gives latency of 1µs, bandwidth 350
Mbyte/sec bidirectional

6
SP Architecture

Cluster of SMP nodes

Interconnect
Memory
CPU
CPU
7
SP Communication Network

Nodes are connected via adapters to the SP
Switch. Switch is composed of boards which link
16 nodes. Boards are linked to form larger
network.

Switch Board
8
SP Communication Network

The peak bandwidth of adapter and switch is 300
Mbyte/sec bidirectional
Latency of the switch is about 2µs
Sustainable bandwidth is about 185 Mbytes/sec
bidirectional

9
MPI Performance
Intra-node is 1 MPI process per node, 2 MPI
processes (typical) will halve bandwidth
10
MPI Performance
11
MPI Performance
12
T3E I/O Configuration

PEs do not have local disk
All PEs access all filesystems equivalently
Path for (optimum) I/O generally looks like
PE to I/O node via torus
I/O node to Fibre Channel Node (FCN) via Gigaring
FCN to Disk Array via Fibre loop
In some cases data on APP PE must be transferred
to a system buffer on an OS PE then out to an FCN

13
T3E I/O Configuration
14
SP I/O Configuration

Nodes have local disk. One SCSI disk for all
local filesystems. Non-optimal.
All nodes access Global Parallel File System
(GPFS) filesystems equivalently
Path for GPFS I/O looks like
Node to GPFS Node via IP over the switch
GPFS Node to Disk Array via SSA loop

15
SP I/O Configuration
Disk Array
GPFS Nodes
Nodes
Switch
Switch
16
T3E Filesystems

/usr/tmp
fast
subject to 14 day purge, not backed up
check quota with quota -s /usr/tmp (usually 75Gb
and 6000 inodes)
TMPDIR
fast
purged at end of job or session
shares quota with /usr/tmp
HOME
slower
permanent, backed up
check quota with quota (usually 2Gb and 3500
inodes)

17
SP Filesystems

/scratch and SCRATCH
global
fast (GPFS)
subject to 14 day purge (or at session end for
SCRATCH), not backed up
check quota with myquota (usually 100Gb and 6000
inodes)
TMPDIR
local (created in /scr) - only 2 Gbyte total
slower
purged at end of job or session
HOME
global
slower (GPFS)
permanent, not backed up yet
check quota with myquota (usually 4Gb and 5000
inodes)

18
Types of I/O

Bewildering number of choices on both machines
Standard Language I/O Fortran or C (ANSI or
POSIX)
Vendor extensions to language I/O
MPI I/O
Cray FFIO library (can be used from Fortran or C)
IBM MIO library, requires code changes

19
Standard Language I/O

Fortran direct access is slightly more efficient
then sequential access both on the T3E (see
comments on FFIO later) and the SP. It also
allows file transferability.
C language I/O (fopen, fwrite, etc.) is
inefficient on both machines.
POSIX standard I/O (open, read, etc.) can be
efficient on the T3E, but requires care (see
comments on FFIO later). Works well on the SP.

20
Vendor Extensions to Language I/O

Cray has a number of I/O routines (aqopen, etc.)
which are legacies from the PVP systems.
Non-portable.
IBM has extended Fortran syntax to provide
asynchronous I/O. Non-portable.

21
MPI I/O

Part of MPI-2
Interface for High Performance Parallel I/O
data partitioning
collective I/O
asynchronous I/O
portability and interoperability bwteen T3E and
SP
Different subset implemented on T3E and SP

22
Summary of access routines for T3E
23
Summary of access routines for SP
24
Cray FFIO library

FFIO is a set of I/O layers tuned for different
I/O characteristics
Buffering of data (configurable size)
Caching of data (configurable size)
Available to regular Fortran I/O without
reprogramming
Available for C through POSIX-like calls, e.g.
ffopen, ffwrite

25
FFIO - The assign command

controls program behavior at runtime
the assign command controls
controls which FFIO layer is active
striping across multiple partitions
lots more
scope of assign
File name
Fortran unit number
File type (e.g. all sequential unformatted files)

26
IBM MIO library

User interface based on POSIX I/O routines, so
requires program modification
Useful trace module to collect statistics
Not much experience with using on GPFS filesystem
Coming soon

27
I/O Strategies - Exclusive access files

Each process reads and writes to a separate file
Language I/O
Increase language I/O performance with FFIO
library (for example, sepcify a large buffer with
the bufa layer) on T3E. For Fortran direct access
default buffer is only the maximum of the record
length or 32 Kbytes
read/write large amounts of data per request on
the SP
MPI I/O
read/write large amounts of data per request

28
bufa FFIO layer Overview

bufa is an asynchronous buffering layer
performs read-ahead, write-behind
specify buffer size with -F bufabsnbufs where
bs is the buffer size in units of 4Kbyte blocks,
and nbufs is the number of buffers
buffer space increases your applications memory
requirements

29
I/O Strategies - Shared files

All PEs read and write the same file
simultaneously
Language I/O (requires FFIO library global layer
for T3E)
MPI I/O
On T3E, language I/O with FFIO library global
layer and Cray extensions for additional
flexibility

30
Positioning with a shared file

Positioning of a read or write is your
responsibility
File pointers are private
Fortran
Use a direct access file, and read/write(recnum)
Use Cray T3E extensions setpos and getpos to
position file pointer (not portable)
C
Use ffseek
MPI I/O
MPI I/O fileview generally takes care of this.
Positioning routines also available.

31
global FFIO layer Overview

global is a caching and buffering layer which
enables multiple PEs to read and write to the
same file
if one PE has already read the data, an
additional read request from another PE will
result in a remote memory copy
file open is a synchronizing event
By default, all PEs must open a global file, this
can be changed by calling GLIO_GROUP_MPI(comm)
specify buffer size with -F globalbsnbufs where
bs is the buffer size in units of 4Kbyte blocks,
and nbufs is the number of buffers per PE

32
GPFS and shared files

On the T3E the global FFIO layer takes care of
updates to a file from multiple PEs by tracking
the state of the file across all PEs.
On the SP, GPFS implements a safe update scheme
via tokens and a token manager.
If two processes access the same block of a GPFS
file (256 Kbytes), a negotiation is conducted
between the nodes and the token manager to
determine the order of updates. This can slow
down I/O considerably.
MPI I/O merges requests from different processes
to alleviate this problem

33
I/O Performance Comparison

Each process writes a 200 Mbyte file. 2 processes
per node on SP.

34
Further Information

I/O on the T3E Tutorial by Richard Gerber at
http//home.nersc.gov/training/tutorials
Cray Publication - Application Programmers I/O
Guide
Cray Publication - Cray T3E Fortran Optimization
Guide
man assign
XL Fortran Users Guide

Write a Comment

User Comments (0)

About PowerShow.com

Comparison of Communication and IO of the Cray T3E and IBM SP PowerPoint PPT Presentation