Parallel IO problems that you are likely to encounter in CSAR - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Parallel IO problems that you are likely to encounter in CSAR

Description:

HPF-style Cyclic, Block distributions spread data evenly across processors ... High-performance switch-connected workstation cluster ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 39

Provided by: ying5

Category:

more less

Transcript and Presenter's Notes

Title: Parallel IO problems that you are likely to encounter in CSAR

1
Parallel I/O problems that you are likely to
encounter in CSAR

drl.cs.uiuc.edu/pubs/pio.html
Marianne Winslett
Department of Computer Science

2
Outline

Why parallel I/O?
Application needs
Traditional common I/O solutions
Why is parallel I/O hard?
Parallel I/O functionality
Parallel I/O performance
MPI-IO for portable I/O support

3
Parallel I/O needs froma simulations point of
view

Why work on processors when I/O is where the
action is?
- David Patterson

Large amounts of data to write (GB)
Not much reading unless you are out of core
I/O often 1/2 of run time (youd like 1 MB/MIP
but you arent getting it)
Certain I/O operations, I/O access patterns, data
types are typical

4
I/O quantities

Large data sets
gt 200 MB per output operation
Frequent data dumps checkpoint, timestep outputs
Large number of files
Larger amounts as the application size and
machine size increase

5
I/O performance

Need fast data dump techniques
Tens or hundreds of snapshots of data per run
Need support for fast restart
Resume from scheduled/unscheduled system
breakdowns
Need support for efficient I/O during data
analysis and visualization
Higher bandwidth and shorter response time as the
application size and machine size increase

6
Common I/O operations

Compulsory (initialization, mostly reads)
Checkpoint/restart
Timestep output data
Writing/reading temporary data generated in the
computational phases
Out-of-core reads/writes
Data migration
Post-processing for data analysis and
visualization

7
Collective I/O

Processors are relatively closely synchronized
Must exchange data with neighbors before resuming
computation
All ready to output their parts of the data set
at roughly the same time
Potential to cooperate with one another during
I/O, to reduce total I/O time

8
Common parallel I/O patterns

Typical I/O access patterns revealed in a single
run
Reading small amount of initial data for
application computations
Mostly collective, sometimes non-collective
Generating large amount of intermediate
computational results for post-processing
Mostly collective, may involve reorganization
Reading intermediate results
Mostly collective, may involve reorganization
Data analysis and visualization
Collective/non-collective

9
Common data types

2D/3D dense/sparse multidimensional arrays
Each array element of fixed length
Fine/coarse grained data distributions
HPF-style Cyclic, Block distributions spread data
evenly across processors
AMR-style distributions dont
Affects communication overhead and load balance
during I/O

10
I/O interfaces on parallel platforms

Unix-like file system interfaces
Pleasantly familiar
Dont take parallel environment into account
Performance will be poor unless you have a
library to help you
System-dependent interfaces
Offer many different I/O modes and options
May offer high performance
Your I/O code will not be portable

11
I/O interface problem
12
Case studies of current I/O systems

Examples
IBM SP
Intel Paragon
Origin 2000
Cray T3E
Workstation clusters

13
Workstation clusters

E.g., FDDI-connected HP workstation cluster with
small number of nodes
I/O support on the HP workstation cluster
HP-UX file system
Fast access to local disk
No shared file system

14
Parallel I/O approaches for clusters
15
Parallel I/O approaches for clusters
...
Network
...
16
Multiple independent I/O nodes

If each node sends its data to its local disk
Fast
Many files
Non-canonical input/output format
Need tools to help with data loading, migration,
postprocessing

17
Multiple independent I/O nodes

If only some nodes act as I/O nodes
Opens possibility of canonical output formats
(e.g., rearrange data during I/O, concatenate
resulting files)
But if multiple compute nodes send I/O requests
to the same I/O node at the same time, you will
probably not get anything near peak disk
performance, because seeks are expensive
And the interconnect may be your bottleneck
Conclusion can be fast, but need a library to
help you

18
IBM SP system architecture

High-performance switch-connected workstation
cluster
Each node is a RS6000 workstation that can have
local disk
I/O support on IBM SP
PIOFS parallel file system
2D file layouts, multiple access modes
Very slow (1/2 of disk throughput), unreliable
AIX JFS on each nodes local disk
Fast, but no shared file system

19
Parallel I/O on the SP

If you dont use PIOFS, your options are the same
as with the HP cluster
And you probably dont want to use PIOFS

20
Origin 2000 system architecture

A distributed shared memory and I/O system
Each node consists of 2 CPUs, caches, memory and
directory
Interconnection of nodes binary n-cube
I/O support on 16-node Origin 2000 at NCSA
15 MB/sec sustained throughput for writes
2 SCSI-2 RAID adapters striped via XFS file
system volume manager, XLV (max 40 MB/s)
8 9-GB 7200 RPM disks per RAID

21
Your I/O options on the NCSA O2000

You dont really have any. I/O will be a
bottleneck for you if you save a lot of data
Shared file systems
Make it simple to create output in canonical
format
But will be very slow if each processor seeks to
the right spot and does a small write
Usually show little speedup if you add a library
that supports multiple I/O nodes writing to the
file system no true parallelism with the shared
file system. It doesnt have to be this way, but
it is
Expose you to the I/O costs of other applications

22
Intel Paragon system architecture

Distributed memory platform
I/O support on Intel Paragon
Certain nodes are dedicated to I/O
PFS parallel file system at Caltech Paragon
Can sustain 84 MB/sec, 512 compute nodes
Multiple access modes (M_SYNC, M_UNIX, etc.)
intended to provide high performance for
different situations, but mainly add complexity
for users
Data automatically striped across 92 I/O nodes

23
Cray T3E

Distributed memory machine
Separate compute and (system-controlled) I/O
nodes
I/O support on Cray T3E at PSC
Shared Cray Unix file system
Separate channels for I/O and communication
File system can support over 40 MB/sec sustained
with a single requester
The larger the write request, the better
Additional requesters increases throughput
slightly

24
Your options on the Cray T3E

Very hard to reach near-peak throughput
Performance very sensitive to activities of other
applications
Small write requests are still a bad idea
Still, if you dont have an I/O library to help
you, you will probably get faster I/O here than
on other platforms (e.g., 20 MB/s with large
write requests)

25
Summary of what you will find

Simulations are often I/O intensive
Poor parallel I/O system support out there
Need parallel I/O solutions!

26
Understanding the parallel I/O problems

Poor I/O performance
Non-portable I/O code
Complex parallel I/O systems
Complex and changing I/O access patterns

27
Causes of the parallel I/O problems

Poor I/O performance
Unsuitable I/O interfaces
Cannot capture application I/O semantics, e.g.,
I want to write this distributed array
Conceptually simple I/O operations are
transformed into inefficient and complex low
level I/O requests, with many seeks, buffering
errors, and partial writes of disk blocks
Full I/O parallelism lacking

28
Full I/O parallelism

I/O approach must scale up as number of
processors increases
Shared file systems can become centralized
bottlenecks
Each I/O node should be writing at top speed
special support for collective I/O
Requires careful load balancing
Communications costs must also be balanced

29
Causes of the parallel I/O problems

Non-portable I/O codes
System dependent interfaces
The MPI-IO Standard
Portable interfaces
High-performance implementations on different
platforms

30
MPI-IO

I/O interface of the MPI-2 standard
Goals
Application portability
I/O performance
File interoperability
Support common I/O access patterns
Support different storage device hierarchies

31
Parallel I/O interface
Application
Application
Application
Portable parallel I/O interface (MPI-IO)
32
MPI-IO implementations

ROMIO from ANL
www.mcs.anl.gov/mpi/mpi2
MPI-IO from LLNL
www.llnl.gov/people/trj/goddard
MPI-IO from IBM
www.research.ibm.com/p/prost/sections/mpiio.html
MPI-IO from NASA Ames
parallel.nas.nasa.gov/MPI-IO/pmpio/pmpio.html

33
Causes of the parallel I/O problems

Complex parallel I/O resources
Many interdependent system modules
Processors, memory, disks, tapes, interconnects,
Many options to consider simultaneously
File layouts, access modes
Many performance factors
Disk/file system utilization
Communication system utilization
Load-balancing
Parallelism

34
Causes of the parallel I/O problems

Complex and changing I/O access patterns
Mixture of reads/writes
Mixture of fine-grained/coarse-grained data
distributions
Trouble with balancing
checkpoint/restart operations
timestep output/data analysis and visualization
operations

35
Our observations

Parallel I/O systems need to provide
Ease-of-use
Simple I/O interfaces
Automatic parallelism for I/O and data migration
Application portability
High-performance I/O strategies for a wide range
of system conditions automatic I/O strategy
selection without human intervention

36
Parallel I/O strategies
Application
Application
Application
Automatic I/O strategy selection
37
References

Parallel I/O archive
http//www.cs.dartmouth.edu/pario/bib
Panda
http//www.drl.cs.uiuc.edu/panda

38
The state-of-the-art parallel I/O system
Rocket Simulation
Parallel I/O servers
Timestep Checkpoint
Network
Parallel I/O interface
Parallel I/O clients
Secondary storage
Tertiary storage

Write a Comment

User Comments (0)