Comparison of Communication and IO of the Cray T3E and IBM SP PowerPoint PPT Presentation

presentation player overlay
1 / 34
About This Presentation
Transcript and Presenter's Notes

Title: Comparison of Communication and IO of the Cray T3E and IBM SP


1
Comparison of Communication and I/O of the Cray
T3E and IBM SP
  • Jonathan Carter
  • NERSC User Services

2
Overview
  • Node Characteristics
  • Interconnect Characteristics
  • MPI Performance
  • I/O Configuration
  • I/O Performance

3
T3E Architecture
  • Distributed memory, single CPU processing elements

4
T3E Communication Network
  • Processing Elements (PE) are connected by a 3D
    torus.

5
T3E Communication Network
  • The peak bandwidth of the torus is about 600
    Mbyte/sec bidirectional
  • Sustainable bandwidth is about 480 Mbytes/sec
    bidirectional
  • Latency is ? 1µs
  • Shmem API gives latency of 1µs, bandwidth 350
    Mbyte/sec bidirectional

6
SP Architecture
  • Cluster of SMP nodes

Interconnect
Memory
CPU
CPU
7
SP Communication Network
  • Nodes are connected via adapters to the SP
    Switch. Switch is composed of boards which link
    16 nodes. Boards are linked to form larger
    network.

Switch Board
8
SP Communication Network
  • The peak bandwidth of adapter and switch is 300
    Mbyte/sec bidirectional
  • Latency of the switch is about 2µs
  • Sustainable bandwidth is about 185 Mbytes/sec
    bidirectional

9
MPI Performance
Intra-node is 1 MPI process per node, 2 MPI
processes (typical) will halve bandwidth
10
MPI Performance
11
MPI Performance
12
T3E I/O Configuration
  • PEs do not have local disk
  • All PEs access all filesystems equivalently
  • Path for (optimum) I/O generally looks like
  • PE to I/O node via torus
  • I/O node to Fibre Channel Node (FCN) via Gigaring
  • FCN to Disk Array via Fibre loop
  • In some cases data on APP PE must be transferred
    to a system buffer on an OS PE then out to an FCN

13
T3E I/O Configuration
14
SP I/O Configuration
  • Nodes have local disk. One SCSI disk for all
    local filesystems. Non-optimal.
  • All nodes access Global Parallel File System
    (GPFS) filesystems equivalently
  • Path for GPFS I/O looks like
  • Node to GPFS Node via IP over the switch
  • GPFS Node to Disk Array via SSA loop

15
SP I/O Configuration
Disk Array
GPFS Nodes
Nodes
Switch
Switch
16
T3E Filesystems
  • /usr/tmp
  • fast
  • subject to 14 day purge, not backed up
  • check quota with quota -s /usr/tmp (usually 75Gb
    and 6000 inodes)
  • TMPDIR
  • fast
  • purged at end of job or session
  • shares quota with /usr/tmp
  • HOME
  • slower
  • permanent, backed up
  • check quota with quota (usually 2Gb and 3500
    inodes)

17
SP Filesystems
  • /scratch and SCRATCH
  • global
  • fast (GPFS)
  • subject to 14 day purge (or at session end for
    SCRATCH), not backed up
  • check quota with myquota (usually 100Gb and 6000
    inodes)
  • TMPDIR
  • local (created in /scr) - only 2 Gbyte total
  • slower
  • purged at end of job or session
  • HOME
  • global
  • slower (GPFS)
  • permanent, not backed up yet
  • check quota with myquota (usually 4Gb and 5000
    inodes)

18
Types of I/O
  • Bewildering number of choices on both machines
  • Standard Language I/O Fortran or C (ANSI or
    POSIX)
  • Vendor extensions to language I/O
  • MPI I/O
  • Cray FFIO library (can be used from Fortran or C)
  • IBM MIO library, requires code changes

19
Standard Language I/O
  • Fortran direct access is slightly more efficient
    then sequential access both on the T3E (see
    comments on FFIO later) and the SP. It also
    allows file transferability.
  • C language I/O (fopen, fwrite, etc.) is
    inefficient on both machines.
  • POSIX standard I/O (open, read, etc.) can be
    efficient on the T3E, but requires care (see
    comments on FFIO later). Works well on the SP.

20
Vendor Extensions to Language I/O
  • Cray has a number of I/O routines (aqopen, etc.)
    which are legacies from the PVP systems.
    Non-portable.
  • IBM has extended Fortran syntax to provide
    asynchronous I/O. Non-portable.

21
MPI I/O
  • Part of MPI-2
  • Interface for High Performance Parallel I/O
  • data partitioning
  • collective I/O
  • asynchronous I/O
  • portability and interoperability bwteen T3E and
    SP
  • Different subset implemented on T3E and SP

22
Summary of access routines for T3E
23
Summary of access routines for SP
24
Cray FFIO library
  • FFIO is a set of I/O layers tuned for different
    I/O characteristics
  • Buffering of data (configurable size)
  • Caching of data (configurable size)
  • Available to regular Fortran I/O without
    reprogramming
  • Available for C through POSIX-like calls, e.g.
    ffopen, ffwrite

25
FFIO - The assign command
  • controls program behavior at runtime
  • the assign command controls
  • controls which FFIO layer is active
  • striping across multiple partitions
  • lots more
  • scope of assign
  • File name
  • Fortran unit number
  • File type (e.g. all sequential unformatted files)

26
IBM MIO library
  • User interface based on POSIX I/O routines, so
    requires program modification
  • Useful trace module to collect statistics
  • Not much experience with using on GPFS filesystem
  • Coming soon

27
I/O Strategies - Exclusive access files
  • Each process reads and writes to a separate file
  • Language I/O
  • Increase language I/O performance with FFIO
    library (for example, sepcify a large buffer with
    the bufa layer) on T3E. For Fortran direct access
    default buffer is only the maximum of the record
    length or 32 Kbytes
  • read/write large amounts of data per request on
    the SP
  • MPI I/O
  • read/write large amounts of data per request

28
bufa FFIO layer Overview
  • bufa is an asynchronous buffering layer
  • performs read-ahead, write-behind
  • specify buffer size with -F bufabsnbufs where
    bs is the buffer size in units of 4Kbyte blocks,
    and nbufs is the number of buffers
  • buffer space increases your applications memory
    requirements

29
I/O Strategies - Shared files
  • All PEs read and write the same file
    simultaneously
  • Language I/O (requires FFIO library global layer
    for T3E)
  • MPI I/O
  • On T3E, language I/O with FFIO library global
    layer and Cray extensions for additional
    flexibility

30
Positioning with a shared file
  • Positioning of a read or write is your
    responsibility
  • File pointers are private
  • Fortran
  • Use a direct access file, and read/write(recnum)
  • Use Cray T3E extensions setpos and getpos to
    position file pointer (not portable)
  • C
  • Use ffseek
  • MPI I/O
  • MPI I/O fileview generally takes care of this.
    Positioning routines also available.

31
global FFIO layer Overview
  • global is a caching and buffering layer which
    enables multiple PEs to read and write to the
    same file
  • if one PE has already read the data, an
    additional read request from another PE will
    result in a remote memory copy
  • file open is a synchronizing event
  • By default, all PEs must open a global file, this
    can be changed by calling GLIO_GROUP_MPI(comm)
  • specify buffer size with -F globalbsnbufs where
    bs is the buffer size in units of 4Kbyte blocks,
    and nbufs is the number of buffers per PE

32
GPFS and shared files
  • On the T3E the global FFIO layer takes care of
    updates to a file from multiple PEs by tracking
    the state of the file across all PEs.
  • On the SP, GPFS implements a safe update scheme
    via tokens and a token manager.
  • If two processes access the same block of a GPFS
    file (256 Kbytes), a negotiation is conducted
    between the nodes and the token manager to
    determine the order of updates. This can slow
    down I/O considerably.
  • MPI I/O merges requests from different processes
    to alleviate this problem

33
I/O Performance Comparison
  • Each process writes a 200 Mbyte file. 2 processes
    per node on SP.

34
Further Information
  • I/O on the T3E Tutorial by Richard Gerber at
    http//home.nersc.gov/training/tutorials
  • Cray Publication - Application Programmers I/O
    Guide
  • Cray Publication - Cray T3E Fortran Optimization
    Guide
  • man assign
  • XL Fortran Users Guide
Write a Comment
User Comments (0)
About PowerShow.com