Title: Parallel IO problems that you are likely to encounter in CSAR
1Parallel I/O problems that you are likely to
encounter in CSAR
- drl.cs.uiuc.edu/pubs/pio.html
- Marianne Winslett
- Department of Computer Science
2Outline
- Why parallel I/O?
- Application needs
- Traditional common I/O solutions
- Why is parallel I/O hard?
- Parallel I/O functionality
- Parallel I/O performance
- MPI-IO for portable I/O support
3Parallel I/O needs froma simulations point of
view
- Why work on processors when I/O is where the
action is? -
- David Patterson
- Large amounts of data to write (GB)
- Not much reading unless you are out of core
- I/O often 1/2 of run time (youd like 1 MB/MIP
but you arent getting it) - Certain I/O operations, I/O access patterns, data
types are typical
4I/O quantities
- Large data sets
- gt 200 MB per output operation
- Frequent data dumps checkpoint, timestep outputs
- Large number of files
- Larger amounts as the application size and
machine size increase
5I/O performance
- Need fast data dump techniques
- Tens or hundreds of snapshots of data per run
- Need support for fast restart
- Resume from scheduled/unscheduled system
breakdowns - Need support for efficient I/O during data
analysis and visualization - Higher bandwidth and shorter response time as the
application size and machine size increase
6Common I/O operations
- Compulsory (initialization, mostly reads)
- Checkpoint/restart
- Timestep output data
- Writing/reading temporary data generated in the
computational phases - Out-of-core reads/writes
- Data migration
- Post-processing for data analysis and
visualization
7Collective I/O
- Processors are relatively closely synchronized
- Must exchange data with neighbors before resuming
computation - All ready to output their parts of the data set
at roughly the same time - Potential to cooperate with one another during
I/O, to reduce total I/O time
8Common parallel I/O patterns
- Typical I/O access patterns revealed in a single
run - Reading small amount of initial data for
application computations - Mostly collective, sometimes non-collective
- Generating large amount of intermediate
computational results for post-processing - Mostly collective, may involve reorganization
- Reading intermediate results
- Mostly collective, may involve reorganization
- Data analysis and visualization
- Collective/non-collective
9Common data types
- 2D/3D dense/sparse multidimensional arrays
- Each array element of fixed length
- Fine/coarse grained data distributions
- HPF-style Cyclic, Block distributions spread data
evenly across processors - AMR-style distributions dont
- Affects communication overhead and load balance
during I/O
10 I/O interfaces on parallel platforms
- Unix-like file system interfaces
- Pleasantly familiar
- Dont take parallel environment into account
- Performance will be poor unless you have a
library to help you - System-dependent interfaces
- Offer many different I/O modes and options
- May offer high performance
- Your I/O code will not be portable
11I/O interface problem
12Case studies of current I/O systems
- Examples
- IBM SP
- Intel Paragon
- Origin 2000
- Cray T3E
- Workstation clusters
13Workstation clusters
- E.g., FDDI-connected HP workstation cluster with
small number of nodes - I/O support on the HP workstation cluster
- HP-UX file system
- Fast access to local disk
- No shared file system
14Parallel I/O approaches for clusters
15Parallel I/O approaches for clusters
...
Network
...
16Multiple independent I/O nodes
- If each node sends its data to its local disk
- Fast
- Many files
- Non-canonical input/output format
- Need tools to help with data loading, migration,
postprocessing
17Multiple independent I/O nodes
- If only some nodes act as I/O nodes
- Opens possibility of canonical output formats
(e.g., rearrange data during I/O, concatenate
resulting files) - But if multiple compute nodes send I/O requests
to the same I/O node at the same time, you will
probably not get anything near peak disk
performance, because seeks are expensive - And the interconnect may be your bottleneck
- Conclusion can be fast, but need a library to
help you
18IBM SP system architecture
- High-performance switch-connected workstation
cluster - Each node is a RS6000 workstation that can have
local disk - I/O support on IBM SP
- PIOFS parallel file system
- 2D file layouts, multiple access modes
- Very slow (1/2 of disk throughput), unreliable
- AIX JFS on each nodes local disk
- Fast, but no shared file system
19Parallel I/O on the SP
- If you dont use PIOFS, your options are the same
as with the HP cluster - And you probably dont want to use PIOFS
20Origin 2000 system architecture
- A distributed shared memory and I/O system
- Each node consists of 2 CPUs, caches, memory and
directory - Interconnection of nodes binary n-cube
- I/O support on 16-node Origin 2000 at NCSA
- 15 MB/sec sustained throughput for writes
- 2 SCSI-2 RAID adapters striped via XFS file
system volume manager, XLV (max 40 MB/s) - 8 9-GB 7200 RPM disks per RAID
21Your I/O options on the NCSA O2000
- You dont really have any. I/O will be a
bottleneck for you if you save a lot of data - Shared file systems
- Make it simple to create output in canonical
format - But will be very slow if each processor seeks to
the right spot and does a small write - Usually show little speedup if you add a library
that supports multiple I/O nodes writing to the
file system no true parallelism with the shared
file system. It doesnt have to be this way, but
it is - Expose you to the I/O costs of other applications
22Intel Paragon system architecture
- Distributed memory platform
- I/O support on Intel Paragon
- Certain nodes are dedicated to I/O
- PFS parallel file system at Caltech Paragon
- Can sustain 84 MB/sec, 512 compute nodes
- Multiple access modes (M_SYNC, M_UNIX, etc.)
intended to provide high performance for
different situations, but mainly add complexity
for users - Data automatically striped across 92 I/O nodes
23Cray T3E
- Distributed memory machine
- Separate compute and (system-controlled) I/O
nodes - I/O support on Cray T3E at PSC
- Shared Cray Unix file system
- Separate channels for I/O and communication
- File system can support over 40 MB/sec sustained
with a single requester - The larger the write request, the better
- Additional requesters increases throughput
slightly
24Your options on the Cray T3E
- Very hard to reach near-peak throughput
- Performance very sensitive to activities of other
applications - Small write requests are still a bad idea
- Still, if you dont have an I/O library to help
you, you will probably get faster I/O here than
on other platforms (e.g., 20 MB/s with large
write requests)
25Summary of what you will find
- Simulations are often I/O intensive
- Poor parallel I/O system support out there
- Need parallel I/O solutions!
26Understanding the parallel I/O problems
- Poor I/O performance
- Non-portable I/O code
- Complex parallel I/O systems
- Complex and changing I/O access patterns
27Causes of the parallel I/O problems
- Poor I/O performance
- Unsuitable I/O interfaces
- Cannot capture application I/O semantics, e.g.,
I want to write this distributed array - Conceptually simple I/O operations are
transformed into inefficient and complex low
level I/O requests, with many seeks, buffering
errors, and partial writes of disk blocks - Full I/O parallelism lacking
28Full I/O parallelism
- I/O approach must scale up as number of
processors increases - Shared file systems can become centralized
bottlenecks - Each I/O node should be writing at top speed
special support for collective I/O - Requires careful load balancing
- Communications costs must also be balanced
29Causes of the parallel I/O problems
- Non-portable I/O codes
- System dependent interfaces
- The MPI-IO Standard
- Portable interfaces
- High-performance implementations on different
platforms
30MPI-IO
- I/O interface of the MPI-2 standard
- Goals
- Application portability
- I/O performance
- File interoperability
- Support common I/O access patterns
- Support different storage device hierarchies
31Parallel I/O interface
Application
Application
Application
Portable parallel I/O interface (MPI-IO)
32MPI-IO implementations
- ROMIO from ANL
- www.mcs.anl.gov/mpi/mpi2
- MPI-IO from LLNL
- www.llnl.gov/people/trj/goddard
- MPI-IO from IBM
- www.research.ibm.com/p/prost/sections/mpiio.html
- MPI-IO from NASA Ames
- parallel.nas.nasa.gov/MPI-IO/pmpio/pmpio.html
33Causes of the parallel I/O problems
- Complex parallel I/O resources
- Many interdependent system modules
- Processors, memory, disks, tapes, interconnects,
- Many options to consider simultaneously
- File layouts, access modes
- Many performance factors
- Disk/file system utilization
- Communication system utilization
- Load-balancing
- Parallelism
34Causes of the parallel I/O problems
- Complex and changing I/O access patterns
- Mixture of reads/writes
- Mixture of fine-grained/coarse-grained data
distributions - Trouble with balancing
- checkpoint/restart operations
- timestep output/data analysis and visualization
operations
35Our observations
- Parallel I/O systems need to provide
- Ease-of-use
- Simple I/O interfaces
- Automatic parallelism for I/O and data migration
- Application portability
- High-performance I/O strategies for a wide range
of system conditions automatic I/O strategy
selection without human intervention
36Parallel I/O strategies
Application
Application
Application
Automatic I/O strategy selection
37References
- Parallel I/O archive
- http//www.cs.dartmouth.edu/pario/bib
- Panda
- http//www.drl.cs.uiuc.edu/panda
38The state-of-the-art parallel I/O system
Rocket Simulation
Parallel I/O servers
Timestep Checkpoint
Network
Parallel I/O interface
Parallel I/O clients
Secondary storage
Tertiary storage